Contact details - Appen · 2019-03-06 · number of licensable speech and language resources...

Contact detailsAppen Pty Ltd

Level 69 Help Street

Chatswood, SydneyNSW 2067 Australia

Enquiries:

Sydney office: +61-2-9468-6335US sales enquiries: +1-315-335-4020

Europe: +31-622-799-535Japan & Korea: +1-202-765-7106

China: +61-2-9468-6310

sales@appen.com

www.appen.com

LanguageResources Catalog

Table of Contents

A global leader in linguistic technology solutions 3

Speech Databases - Summary 7

Speech Databases - Detailed 11

Lexica 84

Other Language Resources 88

Appen brings the forefront of speech and language technology to you. We deliver the highest quality in linguistic solutions to government agencies and the world’s largest corporations, with proven expertise in over 150 languages.

We understand the complex linguistic needs of today’s leading organizations. Our unparalleled range of resources and solutions gives you the edge in a wide array of applications, including:

• speech recognition

• text-to-speech synthesis

• speech analytics

• machine translation

• natural language processing

Appen’s reputation as a global leader guarantees you:

• flexibility and rapid response capability

• global coverage in over 150 languages

• highly qualified specialist personnel

• large, closely vetted crowds of in-country native speakers

• tight project management

• keen innovation and creativity

• strict client confidentiality

Appen remains fully independent of any systems provider, although we do enter into close strategic relationships with selected clients. We have been a principal sub-contractor on several European consortium projects, also in addition to supporting similar projects funded by DARPA and other US agencies.

Whatever speech and language data you need for your application, Appen will collect it for you.

Our end-to-end data collection service delivers efficiency and quality, even on multiple large-scale collections in parallel.

Available collection types include:

• telephony – fixed-line, mobile, in-car

• embedded device – in-car, desktop, smartphone, tablet

• single/multi-speaker – speakers selected by demographic or other requirements

• prompt variation – scripted, spontaneous, conversational (dialogue), meeting data

• modality – speech, text, handwriting, gesture, image and other acoustic data

• text corpora and other resources – email, SMS, named entity tags, POS tags

As part of a standard collection, we offer you the following:

• detailed linguistic and cultural research

• script preparation and localization

• crowdsourcing of native speakers

• local and remote speech recording

• transcription and annotation of collected data

• quality assurance and project management

• lexicon entries matching database contents

• packaging of database in a coherent format

A global leader in linguistic technology solutions

Data Collection

A global leader in linguistic technology solutions

Appen provides high quality speech and language technology products and services to technology developers and government organizations, and is recognized as a global leader in the quality and coverage of its products and services.

Our products and services cover a wide range of applications in speech recognition, text-‐to-‐speech synthesis, phonetic search, machine translation and text processing including Natural Language Processing (NLP).

Appen’s client base includes both government agencies and the world’s largest and most respected IT organizations. Our objective is to enhance our clients’ capabilities in the fields of speech and language technology by offering:

• fast-‐track production • tight project management, working to strict timing,

quality and productivity criteria • specialist personnel including highly qualified linguists

and computational linguists to support our customers’ internal resources, particularly in response to surge requirements

• flexibility and rapid response capability which may be difficult for larger organizations to achieve

• global coverage which may be difficult for smaller organizations to achieve

• large crowds of in-‐country native speakers that have passed our screening processes

• high levels of innovation • strict client confidentiality

Appen remains fully independent of any systems provider, although we can and do enter into close strategic relationships with selected clients. We have been a principal sub-‐contractor on several European consortium projects, such as SpeeCon (multiple projects); SALA II (multiple projects); LILA (multiple projects); Orientel and LC-‐Star. Appen has also supported several DARPA and other US-‐funded consortium projects.

Appen Catalogue – Speech and Language Resources

Appen has a large number of licensable speech and language resources currently available and in development. Most of the 150+ languages that Appen has worked in are included in off-‐the-‐shelf offerings. Up-‐to-‐date catalogue information is available at appen.com

Licensable materials cover:

• Fully transcribed speech databases for broadcast, embedded, in-‐car and telephony applications

• Pronunciation lexicons to provide both general and domain specific coverage for a given language (specific categories include names, places, natural numbers)

• Part-‐Of-‐Speech tagged lexicons and Thesauri to support a wide range of Speech and Language Technology development activities

• Corpora annotated for Part-‐Of-‐Speech, Morphological Information, Named Entities

• Parallel Corpora for use in the development of Machine Translation

Appen’s licensable Speech and Language resources offer wide coverage of less commonly taught languages, including languages and dialects of West and North Asia, the Middle East and Africa.

In many cases, licensable resources can be developed on request to meet a particular client’s requirements.

Appen Catalogue – Speech and Language Resources

We use AppenScribe, our proprietary web-based transcription interface, to deliver high-volume, high-quality data transcription and annotation to you.

Whether you are working with speech, text, video or handwriting, AppenScribe supports a large number of languages in native orthography.

Our transcription and annotation services include:

• orthographic transcription

• acoustic event transcription

• phonetic and phonemic transcription

• semantic annotation and Named Entity tagging

• annotation of handwriting and other language data

• TTS evaluation through the provision of MOS scores

• time alignment of transcription and acoustic signal

While we have experience in processing millions of US English utterances in a matter of weeks, we are equally practiced in languages like Somali which lack a standardized written form.

We ensure the highest quality of work through:

• screening and training of in-country transcribers

• automated spelling checks

• rigorous post-processing by senior team members

If you need immediate access to a complete speech and language database, Appen has a long list of licensable resources available. See www.appen.com for our latest catalogue.

Our high-quality licensable materials cover:

• fully transcribed speech databases for broadcast, call center, in-car and telephony applications

• pronunciation lexicons, both general and domain-specific (e.g. names, places, natural numbers)

• POS-tagged lexicons and thesauri

• corpora annotated for POS, morphological information and named entities

• parallel corpora for use in the development of machine translation

Appen’s databases also cover less resourced languages, including dialects of West and North Asia, the Middle East and Africa.

Transcription and Annotation

We offer you the collective expertise of our premier network of freelance consultants around the globe, currently covering over 60 languages.

Appen’s team of over 1,000 highly qualified consultants includes:

• linguists, phoneticians and lexicographers

• language specialists with backgrounds in translation, localization, terminology, education and library sciences

• data annotators with experience in Internet research and search evaluation

Among the key benefits we offer to you:

• specialized resources for custom linguistic consulting

• resource pools of language specialists in over 60 countries

• large-scale recruiting and training for rapid market expansion

• on-demand staffing to respond to urgent project changes

Contact us directly for additional information and project-specific enquiries

Appen’s highly trained evaluation teams maximize the relevance of your search engine in over twenty local markets around the world.

Our in-country search experts each review hundreds of queries daily, ranking results for relevance to user input.

Our teams are familiar with search trends, popular and obscure topics, and the linguistic nuances of your search engine’s target users.

In addition to general-purpose search, we also specialize in vertical categories, including:

• local

• news

• medical

• travel

• finance

• shopping

• social

We also provide you with valuable testing of search features, such as:

• spam filtering

• related query suggestion

• duplicate removal

• business listing verification

• caption generation

Search Relevance Evaluation

Human Resourcing and Crowdsourcing

• Afrikaans

• Arabic (15+ varieties)

• Assamese

• Bahasa Indonesia

• Bahasa Malaysia

• Bakhtiari (Iran)

• Basque

• Bengali

• Bulgarian

• Cantonese (China PRC, China Hong Kong)

• Catalan

• Croatian

• Czech

• Danish

• Dari

• Dutch (Netherlands, Belgium)

• English (10+ varieties)

• Estonian

• Farsi

• Finnish

• French (5 varieties)

• Gallego (Galician)

• German (Austrian, German, Luxembourg, Swiss)

• Greek

• Gujarati

• Haitian Creole

• Hausa

• Hebrew

• Hindi

• Hungarian

• Italian

• Japanese

• Kannada

• Kermanji (Iran)

• Korean (North, South)

• Kurdish (Sorani)

• Laki (Iran)

• Latvian

• Lithuanian

• Luri (Iran)

• Malayalam

• Malagasy

• Mandarin (China, Taiwan)

• Marathi

• Mazanderani (Iran)

• Min

• Norwegian (Nynorsk, Bokmal)

• Oriya

• Pashto

• Polish

• Portuguese (Brazilian, European)

• Romanian

• Russian

• Serbian

• Slovak

• Slovenian

• Somali

• Spanish (15+ varieties)

• Swedish

• Sylheti

• Tagalog

• Tamil

• Telugu

• Thai

• Turkish

• Ukrainian

• Urdu

• Vietnamese

• Wu

• Xiang

Languages covered

The list of languages in which Appen works is continually expanding, and includes:

Capability for additional languages can, on request, be developed rapidly.

Database - Summary

Language Name Database Type Speakers SamplingAudio Hrs

Arabic CGA_ASR001 Microphone, Scripted Speech 150 16.00 345 USD 20,000

Arabic (Eastern Algerian)

EAR_ASR001 Telephony (cell and fixed), Conversational Speech

496 8.00 58 USD 57,500

Arabic English ENA_ASR001 Conversational Telephony 250 8.00 56 USD 35,000

Arabic (MSA) MSA_ASR001 Microphone, Scripted Speech 78 16.00 12 EUR 3,600

Bahasa Indonesia BAH_ASR001 Telephony (cell and fixed), Conversational Speech

1002 8.00 63 USD 45,000

Bengali BEN_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 94 USD 45,000

Bulgarian BUL_ASR001 Telephony (cell and fixed), Conversational Speech

217 8.00 77 USD 30,000

BUL_ASR002 Microphone, Scripted Speech 77 16.00 22 EUR 3,600

Croatian CRO_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 79 USD 30,000

CRO_ASR002 Microphone, Scripted Speech 94 16.00 11 EUR 3,600

Czech CZE_ASR001 Microphone, Scripted Speech 102 16.00 31 EUR 3,600

Dari DAR_ASR001 Telephony (cell and fixed), Conversational Speech

500 8.00 80 USD 45,000

DAR_BRC001 Broadcast Data 0.00 40 USD 22,500

Dutch (Netherlands)

NLD_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 73 USD 30,000

English (Australian)

AUS_ASR001 Telephony (cell and fixed), Conversational Speech

500 8.00 94 USD 20,000

AUS_ASR002 Telephony (cell and fixed), Scripted Speech

1000 8.00 120 USD 31,500

English (Canadian)

ENC_ASR001 Telephony (cell and fixed), Scripted Speech

1000 8.00 144 USD 37,500

English (Indian)

ENI_ASR001 Telephony (cell and fixed), Scripted Speech

2358 8.00 225 USD 45,000

Database - Summary

Indian English ENI_ASR002 Conversational Telephony 540 8.00 135 USD 28,000

English (UK) UKE_ASR001 Telephony (cell and fixed), Conversational Speech

1150 8.00 102 USD 45,000

UKE_ASR002 Voicemail Telephony, Spontaneous Speech

592 8.00 69 USD 37,500

English (US) USE_ASR001 Microphone, Scripted Speech 200 48.00 124 USD 15,000

USE_ASR002 Telephony (cell and fixed), Conversational Speech

20 8.00 14 USD 7,500

Farsi/Persian FAR_ASR001 Telephony (cell and fixed), Scripted Speech

789 8.00 85 USD 45,000

FAR_ASR002 Telephony (cell and fixed), Conversational Speech

1000 8.00 61 USD 57,500

Filipino English ENF_ASR001 Conversational Telephony 450 8.00 107 USD 35,000

French (Canadian)

FRC_ASR001 Telephony (cell and fixed), Scripted Speech

1000 8.00 131 USD 37,500

FRC_ASR002 Microphone, Scripted Speech 120 16.00 46 USD 22,500

FRC_ASR003 Telephony (cell and fixed), Conversational Speech

251 8.00 20 USD 31,500

French (European)

FRF_ASR001 Telephony (cell and fixed), Conversational Speech

563 8.00 50 USD 31,500

FRF_ASR002 Voicemail Telephony, Spontaneous Speech

560 8.00 95 USD 37,500

FRF_ASR003 Microphone, Scripted Speech 98 16.00 26 EUR 3,600

German DEU_ASR001 Microphone, Scripted Speech 127 16.00 33 USD 11,500

DEU_ASR002 Voicemail Telephony, Spontaneous Speech

890 8.00 65 USD 37,500

DEU_ASR003 Microphone, Scripted Speech 77 16.00 25 EUR 3,600

Database - Summary

Hausa HAU_ASR001 Microphone, Scripted Speech 103 16.00 20 EUR 3,600

HAU_ASR002 Telephony (cell), Conversational Speech

200 8.00 66 USD 40,000

Hebrew HEB_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 69 USD 30,000

Hindi HIN_ASR001 Telephony (cell), Scripted Speech 1920 8.00 224 USD 45,000

HIN_ASR002 Telephony (cell and fixed), Conversational Speech

996 8.00 65 USD 45,000

Italian ITA_ASR001 Microphone, Scripted Speech 200 22.05 177 USD 12,500

ITA_ASR002 Microphone, Scripted Speech 103 48.00 189 USD 19,500

ITA_ASR003 Telephony (cell and fixed), Conversational Speech

200 8.00 72 USD 30,000

ITA_ASR004 Voicemail Telephony, Spontaneous Speech

550 8.00 123 USD 37,500

ITA_TTS001 Microphone, Scripted Speech 1 22.05 3 USD 11,500

Japanese JPN_ASR001 Microphone, Scripted Speech 144 16.00 33 EUR 3,600

Kannada KAN_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 30 USD 45,000

Korean KOR_ASR001 Microphone, Scripted Speech 100 16.00 20 EUR 3,600

Mandarin MAC_ASR001 Telephony (cell), Mixed environments 2000 8.00 115 USD 45,000

MAC_ASR002 Microphone, Scripted Speech 132 16.00 26 EUR 3,600

Marathi MAR_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 30 USD 45,000

Pashto PAS_ASR001 Telephony (cell and fixed), Conversational Speech

967 8.00 111 USD 65,000

PAS_ASR002 Conversational microphone data 40 16.00 80 USD 75,000

PAS_BRC001 Broadcast Data 0.00 51 USD 22,500

Database - Summary

Polish POL_ASR001 Microphone, Scripted Speech 99 16.00 25 EUR 3,600

Portuguese (Brazilian)

PTB_ASR001 Microphone, Scripted Speech 102 16.00 26 EUR 3,600

PTB_ASR002 Telephony (cell and fixed), Conversational Speech

200 8.00 66 USD 35,000

Portuguese (European)

PTP_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 72 USD 30,000

Romanian ROM_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 74 USD 30,000

Russian RUS_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 74 USD 30,000

RUS_ASR002 Microphone, Scripted Speech 115 16.00 31 EUR 3,600

Somali SOM_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 101 USD 65,000

Sorani (Kurdish) SOR_ASR001 Telephony (cell and fixed), Conversational Speech

170 8.00 11 USD 30,000

Spanish (European)

ESP_ASR001 Microphone, Scripted Speech 200 22.05 159 USD 12,500

ESP_ASR002 Voicemail Telephony, Spontaneous Speech

512 8.00 97 USD 37,500

ESP_TTS001 Microphone, Scripted Speech 1 22.05 1 USD 6,000

Spanish (Latin America)

ESL_ASR001 Microphone, Scripted Speech 100 16.00 17 EUR 3,600

Swedish SWE_ASR001 Microphone, Scripted Speech 98 16.00 30 EUR 3,600

Thai THA_ASR001 Microphone, Scripted Speech 98 16.00 35 EUR 3,600

Turkish TUR_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 83 USD 30,000

TUR_ASR002 Microphone, Scripted Speech 100 16.00 17 EUR 3,600

Urdu URD_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 95 USD 45,000

Vietnamese VIE_ASR001 Microphone, Scripted Speech 129 16.00 47 EUR 3,600

Databases

Language Arabic

DB Name CGA_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 150

Prompts per speaker 280

Total utterances/Entr ies 42,000

Audio Hours 345

Sampling rate - kHz 16.00

Recording channels 4

List Pr ice USD 20,000

Brief Descript ion

• This is a 150 speaker microphone recorded database Language Materials

• Each script elicits approximately 30 minutes of recorded speech

• Each Script includes:

o 30 Person names (first name and family name) from a set of 150

o 10 single isolated digits 0-9

o 10 8-digit sequences (randomly generated)

o 200 Phonetically balanced sentences

o 30 10-word phonetically balanced word strings

Demographics

• 50% of speakers are from the United Arab Emirates

• 50% of speakers are from Saudi Arabia

Transcriptions

• Complete transcriptions of the content of the speech files at a word level

• All acoustic events have been tagged using conventions derived from the SpeechDAT

• All transcriptions fully vowelized

Contact Appen for further information

Databases

Language Arabic (Eastern Algerian)

DB Name EAR_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Speakers 496

Prompts per speaker

Total utterances/Entr ies

Audio Hours 58

Brief Descript ion

• This is a 496 speaker conversational** telephony database

• Approximately 29 hours of conversation data (equivalent to 58 hours of single channel

audio)

• Broad distribution of age, gender and dialects (Algiers and Constantine)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For

a smaller number of calls, only one half of the conversation was collected and transcribed

Databases

Language Arabic English

DB Name ENA_ASR001

DB type 1 ASR

Environments Low background noise

Unique speakers 250

Average cal l length 10-15 minutes

Total utterances/Entr ies N/A

Audio Hours 56

Sampling rate – kHz 8.00

Brief Descript ion

• 115 telephony conversations are recorded for this project

• Demographic information is as follows:

o Roughly equal distribution of male and female

o Broad range of ages from 18 years – 55 years

o Approximately 50% landline/50% mobile

o Speakers speak on a range of generic topics

o Roughly equal distribution of Levantine Arabic and Egyptian Arabic speakers

audio)

Databases

Language Arabic (MSA)

DB Name MSA_ASR001

DB type 1 ASR

Speakers 78

Prompts per speaker

Audio Hours 12

List Pr ice EUR 3,600

Brief Descript ion

• This is a 78 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Databases

Language Bahasa Indonesia

DB Name BAH_ASR001

DB type 1 ASR

Speakers 1,002

Prompts per speaker

Audio Hours 63

Brief Descript ion

• This is a 1,002 speaker conversational** telephony database

audio)

** For a large proportion of calls, only one half of the conversation was collected and transcribed

Databases

Language Bengali

DB Name BEN_ASR001

DB type 1 ASR

Speakers 1,000

Prompts per speaker

Audio Hours 94

Brief Descript ion

• This is a 1,000 speaker conversational telephony database

• Approximately 47 hours of conversation data (equivalent to 94 hours of single

channel audio)

Databases

Language Bulgarian

DB Name BUL_ASR001

DB type 1 ASR

Speakers 217

Prompts per speaker

Audio Hours 77

Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls

each (1 from landline, 1 from mobile), to a pool of 100 call receivers

channel audio)

Databases

Language Bulgarian

DB Name BUL_ASR002

DB type 1 ASR

Speakers 77

Prompts per speaker

Audio Hours 22

Brief Descript ion

collaboration with the Karlsruhe Institute of Technology (KIT).

Romanized form (where applicable).

Databases

Language Croatian

DB Name CRO_ASR001

DB type 1 ASR

Speakers 200

Prompts per speaker

Audio Hours 79

Brief Descript ion

• 200 telephony conversations are recorded for this project – 100 speakers make 2 calls

channel audio)

Databases

Language Croatian

DB Name CRO_ASR002

DB type 1 ASR

Speakers 94

Prompts per speaker

Audio Hours 11

Brief Descript ion

Databases

Language Czech

DB Name CZE_ASR001

DB type 1 ASR

Speakers 102

Prompts per speaker

Audio Hours 31

Brief Descript ion

Databases

Language Dari

DB Name DAR_ASR001

DB type 1 ASR

Speakers 500

Prompts per speaker

Audio Hours 80

Brief Descript ion

audio)

• Telephony Distribution

o Landline 13%

o Mobile 87%

Databases

Language Dari

DB Name DAR_BRC001

DB type 1 Broadcast

DB type 2 Broadcast Data

Environments Broadcast Data

Speakers

Prompts per speaker

Audio Hours 40

Brief Descript ion

• Database contains 40 hours of Dari broadcast data

• Database is largely speech only and does not include music or advertisements

• Data types include:

o Talk shows

o Interviews

o News broadcasts (excluding news reading by anchors)

Databases

Language Dutch (Netherlands)

DB Name NLD_ASR001

DB type 1 ASR

Speakers 200

Prompts per speaker

Audio Hours 73

Brief Descript ion

audio)

Databases

Language English (Australian)

DB Name AUS_ASR001

DB type 1 ASR

DB type 2 Telephony

Speakers 500

Audio Hours 94

Brief Descript ion

• This is a 500 speaker telephony database

• 500 Speakers (including some migrant representation - Asian (predominantly Chinese),

Middle Eastern (Predominantly Lebanese), and New Zealand accented English

• 165 prompts (read speech) per speaker, including:

o Digits

o Natural Numbers

o Letter strings

o Personal, place, and business names

o Confirmation items (yes, no + fuzzy)

o Generic Command and Control items (from a set of 215)

o Phonetically rich Sentences and Words

• Mobile 50%, fixed line 50%

• Age and Gender balanced

• Moderately quiet environments (home/office)

• Total audio length: 94 hours

• Fully transcribed to SpeechDAT type conventions

Databases

Language English (Australian)

DB Name AUS_ASR002

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 1,000

Audio Hours 120

Brief Descript ion

• This is a 1,000 speaker Australian English database

• 75 prompts per speaker, including:

o Digits

o Natural Numbers

o Letter strings

o Generic Command and Control items

• The prompts are a mixture of 'read' and 'elicited' items. 5 prompts per script are

'spontaneous free speech’

• Mixture of mobile and landline

• Age and Gender balanced

Databases

Language English (Canadian)

DB Name ENC_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 1,000

Audio Hours 144

Brief Descript ion

• This is an extended SALA II database.

• 49 prompts per speaker are as specified by the SALA II consortium. An additional 50

prompts (similar content) were recorded by each speaker.

o Digits

o Natural Numbers

o Letter strings

• Mobile telephony recorded in a range of environments including in-car, home/office,

roadside and other public place

• Fully transcribed to SALA II/SpeechDAT type conventions

Databases

Language English (Indian)

DB Name ENI_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 2,358

Audio Hours 225

Brief Descript ion

• This is a 2,358 speaker Indian English mobile telephony speech database recorded on

location in India

• Database Type

o Medium Level background noise - in-car, home/office, roadside and other public

place type environments

o Total audio length - Approximately 225 hours

• Demographics

o 2,358 speakers recorded in India

o 50% male, 50% female

o Broad distribution of age groups (16-60 years) and dialects

• Language Materials

o 50 prompts per speaker, including Digits; Natural Numbers; Personal, Place, and

Business Names; Confirmation items (yes, no + fuzzy); Generic Command and

Control items and Phonetically rich Sentences and Words

• Transcription and Lexicon

o Fully transcribed to SpeechDAT type conventions.

o Database is accompanied by a pronunciation lexicon [SAMPA] containing all

transcribed words.

o Lexicon - 10,128 unique headwords

Databases

Language Indian English

DB Name ENI_ASR002

DB type 1 ASR

Unique speakers 540

Audio Hours 135

Brief Descript ion

o Dialect distribution:

Eastern India 10%

Northern India 35%

Pakistan 15%

Southern India 20%

Western India 19%

audio).

Databases

Language English (UK)

DB Name UKE_ASR001

DB type 1 ASR

Speakers 1,150

Prompts per speaker

Audio Hours 102

Brief Descript ion

• Provides good coverage of key accents across the UK and Ireland

audio).

• Note: additional data available - please contact Appen for more details

Databases

Language English (UK)

DB Name UKE_ASR002

DB type 1 ASR

DB type 2 Voicemail Telephony

Speakers 592

Prompts per speaker

Audio Hours 69

Brief Descript ion

• This is a 592 speaker voicemail telephony database

• Broad distribution of age, gender and landline/mobile coverage

• Provides good representation of key accents across the United Kingdom

• Approximately 69 audio hours of voicemail data

• The database covers speakers providing spontaneous voicemail type responses selected

from a pool of approx. 200 common voicemail scenarios (e.g. Leave a message to tell your

colleague that you are running late for a meeting)

• The database is fully transcribed and is accompanied by a pronunciation lexicon

containing all transcribed words

Databases

Language English (US)

DB Name USE_ASR001

DB type 1 ASR

DB type 2 Studio/microphone recordings

Environments Studio

Speakers 200

Audio Hours 124

Brief Descript ion

• This is a 200 speaker microphone recorded database

• Each speaker read 400 prompts including:

o Digits

o Natural Numbers

o Personal and City names

o Telephone Numbers

• All speakers were recorded in a studio type environment in USA

• Database is fully transcribed and is accompanied by a pronunciation lexicon containing all

transcribed words

Databases

Language English (US)

DB Name USE_ASR002

DB type 1 ASR

Speakers 20

Prompts per speaker

Audio Hours 14

Brief Descript ion

• Call-Centre style conversations

• Approximately 7 hours of conversation data in total

Databases

Language Farsi/Persian

DB Name FAR_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 789

Audio Hours 85

Brief Descript ion

• This is a 789 speaker Farsi telephony speech database recorded on location in Iran.

• 50% male, 50% female

• Broad distribution of age groups (16-60 years) and dialects

• Medium Level background noise - in-vehicle, home/office, roadside and other public place

type environments

o 48 prompts per speaker, including Digits; Natural Numbers; Letter strings;

Personal, Place, and Business names; Confirmation items (yes and no); Generic

Command and Control items and Phonetically Rich sentences and words

• Transcriptions

o Fully transcribed to OrienTel type conventions

• Lexicon

transcribed words

• Total audio length - Approximately 85 hours

Databases

Language Farsi/Persian

DB Name FAR_ASR002

DB type 1 ASR

Environments Mixed

Speakers 1,000

Prompts per speaker

Audio Hours 61

Brief Descript ion

audio)

• Database is fully transcribed and time stamped

Databases

Language Filipino English

DB Name ENF_ASR001

DB type 1 ASR

Unique speakers 450

Audio Hours 107

Brief Descript ion

audio).

Databases

Language French (Canadian)

DB Name FRC_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 1,000

Audio Hours 131

Brief Descript ion

• This is an extended SALA II database

• 48 prompts per speaker are as specified by the SALA II consortium. An additional 52

prompts (similar content) were recorded by each speaker

o Digits

o Natural Numbers

o Letter strings

• Mobile telephony recorded in a range of environments including in-car, home/office,

roadside and other public place

• Database is accompanied by a pronunciation lexicon [SAMPA] containing all

transcribed words

Databases

DB Name FRC_ASR002

DB type 1 ASR

DB type 2 Microphone recordings

Speakers 120

Audio Hours 46

Brief Descript ion

• Scripts include:

o Person names

o Digits

o Digit strings (randomly generated)

o Addresses

o Phonetically rich sentences

• Dialects

o 50% Quebecois – Montreal

o 50% Quebecois – Other

Databases

DB Name FRC_ASR003

DB type 1 ASR

Speakers 251

Prompts per speaker

Audio Hours 20

Brief Descript ion

audio)

** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For a

small number of calls, only one half of the conversation was collected and transcribed

Databases

Language French (European)

DB Name FRF_ASR001

DB type 1 ASR

Speakers 563

Prompts per speaker

Audio Hours 50

Brief Descript ion

audio).

** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For a

smaller number of calls, only one half of the conversation was collected and transcribed

Databases

DB Name FRF_ASR002

DB type 1 ASR

Speakers 560

Prompts per speaker

Audio Hours 95

Brief Descript ion

• Provides good representation of key accents across France

Databases

DB Name FRF_ASR003

DB type 1 ASR

Speakers 98

Prompts per speaker

Audio Hours 26

Brief Descript ion

Databases

Language German

DB Name DEU_ASR001

DB type 1 ASR

Environments Studio

Speakers 127

Audio Hours 33

Brief Descript ion

• Each speaker read 100 prompts including:

o Digits

o Natural Numbers

o Personal and City names

o Telephone Numbers

• All speakers were recorded in a studio type environment in Germany

• Database is fully transcribed and is accompanied by a pronunciation lexicon containing all

transcribed words

Databases

Language German

DB Name DEU_ASR002

DB type 1 ASR

Speakers 890

Prompts per speaker

Audio Hours 65

Recording channels

Brief Descript ion

• This is an 890 speaker voicemail telephony database

• Provides good representation of key accents across Germany

Databases

Language German

DB Name DEU_ASR003

DB type 1 ASR

Speakers 77

Prompts per speaker

Audio Hours 25

Brief Descript ion

collaboration with the Karlsruhe Institute of Technology (KIT).

• Each speaker reads a number phonetically rich sentences

Databases

Language Hausa

DB Name HAU_ASR001

DB type 1 ASR

Speakers 103

Prompts per speaker

Audio Hours 20

Brief Descript ion

• Each speaker reads a number phonetically rich sentences

Databases

Language Hausa

DB Name HAU_ASR002

DB type 1 ASR

DB type 2 Conversational telephony

Speakers 200

Prompts per speaker

Audio Hours 66

Brief Descript ion

each, to a pool of 100 call receivers

• channel audio).

Databases

Language Hebrew

DB Name HEB_ASR001

DB type 1 ASR

Speakers 200

Prompts per speaker

Audio Hours 69

Brief Descript ion

• channel audio).

Databases

Language Hindi

DB Name HIN_ASR001

DB type 1 ASR

DB type 2 Telephony

Speakers 1,920

Audio Hours 224

Brief Descript ion

• This is a 1,920 speaker Hindi mobile telephony speech database. The database comprises

1,920 speakers who speak Hindi as a second language (i.e. native speakers of Telugu,

Gujarati, etc who use Hindi as a second language) recorded on location in India

• Database Type

o 1,920 speakers recorded in India

o 50% male, 50% female

o Broad distribution of age groups (16-60 years) and dialects

o Medium Level background noise - in-car, home/office, roadside and other public

place type environments

o 50 prompts per speaker, including Digits; Natural Numbers; Personal, Place and

Business names; Confirmation items (yes, no + fuzzy); Generic Command and

Control items; Phonetically rich Sentences and Words; and Web addresses

• Transcriptions

o Fully transcribed to SpeechDAT type conventions

• Lexicon

transcribed words

o Lexicon - 9,853 unique headwords

o Total audio length - Approximately 224 hours

Databases

Language Hindi

DB Name HIN_ASR002

DB type 1 ASR

Environments Mixed

Speakers 996

Prompts per speaker

Audio Hours 65

Brief Descript ion

channel audio)

** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed.

For a smaller number of calls, only one half of the conversation was collected and

transcribed

Databases

Language Italian

DB Name ITA_ASR001

DB type 1 ASR

Environments Mixed

Speakers 200

Audio Hours 177

Brief Descript ion

• Each speaker read 200 utterances:

o 100 - command and control type items

o 100 - phonetically rich sentences

• Lexicon - 7,316 unique headwords

• Total audio length - 177 hours

Databases

Language Italian

DB Name ITA_ASR002

DB type 1 ASR

Environments In-Car

Speakers 103

Audio Hours 189

Brief Descript ion

• This is a 205 session In-Car database

• Each speaker recorded 1or 2 sessions:

o Session 1 in a parked vehicle with the engine running

o Session 2 in a vehicle travelling at 60 mph (100 km/h).

• 350 prompts were read by each speaker (175) per session) including:

o Digits

o Street names

Databases

Language Italian

DB Name ITA_ASR003

DB type 1 ASR

Speakers 200

Prompts per speaker

Audio Hours 72

Brief Descript ion

audio)

Databases

Language Italian

DB Name ITA_ASR004

DB type 1 ASR

Speakers 550

Prompts per speaker

Audio Hours 123

Brief Descript ion

• Provides good representation of key accents across Italy

Databases

Language Italian

DB Name ITA_TTS001

DB type 1 TTS

Environments Studio

Speakers 1

Prompts per speaker 3,300

Audio Hours 3

Brief Descript ion

• This is a single speaker TTS speech database. The database comprises 3,300 phonetically

rich sentences recorded by a male Italian speaker in a studio environment. The database is

accompanied by a pronunciation lexicon containing an entry for each of the words spoken

in the database

Databases

Language Japanese

DB Name JPN_ASR001

DB type 1 ASR

Speakers 144

Prompts per speaker

Audio Hours 33

Brief Descript ion

Databases

Language Kannada

DB Name KAN_ASR001

DB type 1 ASR

Environments Mixed

Speakers 1,000

Prompts per speaker

Audio Hours 30

Brief Descript ion

audio).

Databases

Language Korean

DB Name KOR_ASR001

DB type 1 ASR

Speakers 100

Prompts per speaker

Audio Hours 20

Brief Descript ion

Databases

Language Mandarin

DB Name MAC_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 2,000

Audio Hours 115

Brief Descript ion

• This is a 2,000 speaker Mandarin mobile telephony speech data collection

• The database comprises 2,000 Mandarin speakers recorded on location in China

• 2,000 speakers recorded in China

• 50% male, 50% female

• 100% Mobile Telephony

• Broad distribution of age groups (16-60 years) Language Materials

o Digits

o Natural Numbers

• Transcriptions

• Lexicon

• Database is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed

Databases

Language Mandarin

DB Name MAC_ASR002

DB type 1 ASR

Speakers 132

Prompts per speaker

Audio Hours 26

Brief Descript ion

Databases

Language Marathi

DB Name MAR_ASR001

DB type 1 ASR

Environments Mixed

Speakers 1,000

Prompts per speaker

Audio Hours 30

Brief Descript ion

audio).

Databases

Language Pashto

DB Name PAS_ASR001

DB type 1 ASR

Speakers 967

Prompts per speaker

Audio Hours 111

Brief Descript ion

audio).

• For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For

a smaller number of calls, only one half of the conversation was collected and transcribed

Databases

Language Pashto

DB Name PAS_ASR002

DB type 1 ASR

DB type 2 Conversational microphone data

Number of sessions 40

Average session length 120 minutes

Audio Hours 80

L ist Pr ice USD 75,000

Br ief Descript ion

• Each recording consists of a number of TransTAC style dialogues (monolingual 2-way

conversations). One speaker acts as an interviewer and the other as the interviewee

• The interviewer appears in more than one set of dialogues but the interviewee is unique for

each set

• Data collection scenarios are similar to TransTAC style (e.g. civil affairs, checkpoints etc.)

o Roughly 25% female and 75% male speakers

o Broad distribution across two dialect regions in Afghanistan

• 40 hours of conversation data (equivalent to 80 hours of single channel audio)

• A full translation of the transcripts into French is also available as an optional additional

purchase

Databases

Language Pashto

DB Name PAS_BRC001

DB type 1 Broadcast

DB type 2 Broadcast Data

Environments Broadcast Data

Speakers

Prompts per speaker

Audio Hours 51

Brief Descript ion

• Database contains 50 hours of Pashto broadcast data

• Database is largely speech only and does not include music or advertisements

• Data types include:

o Talk shows

o Interviews

o News broadcasts (excluding news reading by anchors)

Databases

Language Polish

DB Name POL_ASR001

DB type 1 ASR

Speakers 99

Prompts per speaker

Audio Hours 25

Brief Descript ion

Databases

Language Portuguese (Brazilian)

DB Name PTB_ASR001

DB type 1 ASR

Speakers 102

Prompts per speaker

Audio Hours 26

Brief Descript ion

Databases

Language Portuguese (Brazilian)

DB Name PTB_ASR002

DB type 1 ASR

Speakers 200

Prompts per speaker

Audio Hours 66

Brief Descript ion

• This is a 300 speaker conversational telephony database. For this project (some speakers

have participated in up to 2 calls)

audio).

Databases

Language Portuguese (European)

DB Name PTP_ASR001

DB type 1 ASR

Speakers 200

Prompts per speaker

Audio Hours 72

Brief Descript ion

audio).

Databases

Language Romanian

DB Name ROM_ASR001

DB type 1 ASR

Speakers 200

Prompts per speaker

Audio Hours 74

Brief Descript ion

• 200 telephony conversations are recorded for this project – 100 speakers make 2 calls

audio)

Databases

Language Russian

DB Name RUS_ASR001

DB type 1 ASR

Speakers 200

Prompts per speaker

Audio Hours 74

Brief Descript ion

audio).

Databases

Language Russian

DB Name RUS_ASR002

DB type 1 ASR

Speakers 115

Prompts per speaker

Audio Hours 31

Brief Descript ion

Databases

Language Somali

DB Name SOM_ASR001

DB type 1 ASR

Speakers 1,000

Prompts per speaker

Audio Hours 101

Brief Descript ion

audio)

Databases

Language Sorani (Kurdish)

DB Name SOR_ASR001

DB type 1 ASR

Speakers 170

Prompts per speaker

Audio Hours 11

Brief Descript ion

audio).

• For a large proportion of calls, only one half of the conversation was collected and

transcribed

Databases

Language Spanish (European)

DB Name ESP_ASR001

DB type 1 ASR

Environments Mixed

Speakers 200

Audio Hours 159

Brief Descript ion

• Each speaker read 200 utterances:

o 100 - command and control type items

o 100 - phonetically rich sentences

• Lexicon - 6,367 unique headwords

• Total audio length - 159 hours

Databases

DB Name ESP_ASR002

DB type 1 ASR

Speakers 512

Prompts per speaker

Audio Hours 97

Brief Descript ion

• Provides good representation of key accents across Spain

Databases

DB Name ESP_TTS001

DB type 1 TTS

Environments Studio

Speakers 1

Prompts per speaker 1,787

Audio Hours 1

Brief Descript ion

• This is a single speaker TTS speech database. The database comprises 1,786 phonetically

rich sentences recorded by a male Spanish speaker in a studio environment. The database

is accompanied by a pronunciation lexicon containing an entry for each of the words

spoken in the database

Databases

Language Spanish (Latin America)

DB Name ESL_ASR001

DB type 1 ASR

Speakers 100

Prompts per speaker

Audio Hours 17

Brief Descript ion

Databases

Language Swedish

DB Name SWE_ASR001

DB type 1 ASR

Speakers 98

Prompts per speaker

Audio Hours 30

Brief Descript ion

Databases

Language Thai

DB Name THA_ASR001

DB type 1 ASR

Speakers 98

Prompts per speaker

Audio Hours 35

Brief Descript ion

Databases

Language Turkish

DB Name TUR_ASR001

DB type 1 ASR

Speakers 200

Prompts per speaker

Audio Hours 83

Brief Descript ion

audio).

Databases

Language Turkish

DB Name TUR_ASR002

DB type 1 ASR

Speakers 100

Prompts per speaker

Audio Hours 17

Brief Descript ion

Databases

Language Urdu

DB Name URD_ASR001

DB type 1 ASR

Environments Mixed

Speakers 1,000

Prompts per speaker

Audio Hours 95

Brief Descript ion

• This is a 1,000 speaker conversational telephony database recorded by native Urdu

speakers in Pakistan (700 speakers) and India (300 speakers)

audio).

Databases

Language Vietnamese

DB Name VIE_ASR001

DB type 1 ASR

Speakers 129

Prompts per speaker

Audio Hours 47

Brief Descript ion

Lexica

OverviewAppen Butler Hill has considerable experience in providing a variety of lexicon types. These include

• Pronunciat ion Lexica providing phonemic representation, syllabification, and stress (primary andsecondary as appropriate)

• Part-of-speech tagged Lexica providing grammatical and semantic labels• Other reference text based materia ls including spelling/mis-spelling lists, spell-check

dictionaries, mappings of colloquial language to standard forms, orthographic normalisation lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages (please see language list below).

Domain CoverageTypical domains covered in our off-the-shelf holdings for a given language include:

• General Vocabulary• Geographical Names e.g. Place Names (City, State, Suburb)• Numbers (0-10,000)• Person Names (both Given and Family)

Lexica can be developed from a wordl ist provided by the cl ient or by Appen Butler Hi l l . I f acl ient requires vocabulary of a specif ic nature or to cover a specif ic domain, this cantypical ly be provided under the same license and pricing terms as our pre-exist ing (off- the-shelf) holdings.

Lexicon Structure• Our Lexica are usually created using a SAMPA phone set for the language which aligns SAMPA

symbols with IPA equivalents. We can convert to most other machine readable formats on request• We also include documentation files which include phone set definitions, statistical notes about phone

coverage within a given Lexicon, and may include background information on data quality andvalidation.

Lexica are typically delivered as text files consisting of three or four tab-delimited fields:Field 1 - HeadwordField 2 - SAMPA pronunciationField 3 - Variant Rank (0 = preferred pronunciation; 1 = also heard, less common)Field 4 - Label e.g. (FAMILY_NAME, GIVEN_NAME, COMMON_WORD…etc.)

In addition to the phonemic mark-up, our Lexica are marked up for primary and secondary stress and forsyllabification where applicable. They will also include pronunciation variants where relevant.

LexiconCategory

Brief Descript ionLicense Priceper headword

1 Most languages using Latin based orthographies USD 0.335

2Languages requiring tone mark-up (e.g. Mandarin, Cantonese) and languages

requiring multiple representational forms in the orthography (e.g. Japanese)USD 0.415

3 Languages requiring full diacritization/vowelization (e.g.Arabic) USD 0.460

Pric ing for special ized Languages and Part-of-Speech Tagged Lexica can be provided on

request.

Lexica

Overview Appen has considerable experience in providing a variety of lexicon types. These include:

• Pronunciat ion Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

• Part-of-speech tagged Lexica providing grammatical and semantic labels • Other reference text based materia ls including spelling/mis-spelling lists, spell-check

dictionaries, mappings of colloquial language to standard forms, orthographic normalisation lists.

Over a period of 15 years, Appen has generated a significant volume of licensable material for a wide range of languages (please see language list below). Domain Coverage Typical domains covered in our off-the-shelf holdings for a given language include:

• General Vocabulary • Geographical Names e.g. Place Names (City, State, Suburb) • Numbers (0-10,000) • Person Names (both Given and Family)

Lexica can be developed from a wordl ist provided by the cl ient or by Appen. I f a cl ient requires vocabulary of a specif ic nature or to cover a specif ic domain, th is can typical ly be provided under the same l icense and pric ing terms as our pre-exist ing (off- the-shelf) holdings. Lexicon Structure

• Our Lexica are usually created using a SAMPA phone set for the language which aligns SAMPA symbols with IPA equivalents. We can convert to most other machine readable formats on request

• We also include documentation files which include phone set definitions, statistical notes about phone coverage within a given Lexicon, and may include background information on data quality and validation.

Lexica are typically delivered as text files consisting of three or four tab-delimited fields: Field 1 - Headword Field 2 - SAMPA pronunciation Field 3 - Variant Rank (0 = preferred pronunciation; 1 = also heard, less common) Field 4 - Label e.g. (FAMILY_NAME, GIVEN_NAME, COMMON_WORD…etc.) In addition to the phonemic mark-up, our Lexica are marked up for primary and secondary stress and for syllabification where applicable. They will also include pronunciation variants where relevant.

Lexicon Category

Brief Descript ion

License Price per headword

1 Most languages using Latin based orthographies USD 0.335

2 Languages requiring tone mark-up (e.g. Mandarin, Cantonese) and languages

requiring multiple representational forms in the orthography (e.g. Japanese) USD 0.415

3 Languages requiring full diacritization/vowelization (e.g.Arabic) USD 0.460

Lexica

Number of headwords

New offerings are frequently added. For holdings information in a given language or to discuss any

customized development efforts, please contact:

sales@appen.com

appen.com

0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000

English (Canadian)

English (Australian)

Danish

Croa>an

Catalan

Cantonese

Bulgarian

Bengali

Basque

Bahasa Malay

Bahasa Indonesia

Arabic (UAE)

Arabic (Syrian)

Arabic (South Levan>ne)

Arabic (Pales>nian)

Arabic (North Levan>ne)

Arabic (MSA)

Arabic (Maghrebi)

Arabic (Iraqi)

Arabic (Gulf)

Arabic (Egyp>an)

Arabic (Algerian)

Assamese

>75,000

>55,000

>100,000

>110,000

>75,000

>70,000

Lexica

Number of headwords

sales@appen.com

appen.com

0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000

Norwegian

Marathi

Mandarin

Malayalam

Korean

Kannada

Japanese

Italian

Hungarian

Hebrew

German (Switzerland)

German (Austria)

German

French (Switzerland)

French (Luxembourg)

French (European)

French (Canadian)

French (Belgian)

Finnish

English (US)

English (UK)

English (New Zealand)

English (Indian)

>155,000

>85,000

>60,000

>110,000

>55,000

>190,000

>260,000

>100,000

>115,000

>200,000

Lexica

Number of headwords

sales@appen.com

appen.com

>250,000

>100,000

>90,000

>115,000

>100,000

>50,000

0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000

Vietnamese

Ukrainian

Turkish

Telugu

Tagalog

Sylhe?

Swedish

Swahili(Kenya)

Spanish (Mexican)

Spanish (EU -‐ Cas?lian)

Spanish (American -‐ US)

Spanish (All La?n America)

Sorani (Kurdish)

Somali

Serbian

Russian

Romanian

Portuguese (EU)

Portuguese (Brazil)

Polish

Persian/Farsi

Pashto

>100,000

Other Language Resources Apart from speech databases and lexica, Appen also has a range of other language resources

available for license, which can be found in this section. These resources include:

1. Text Corpora — We have a wide variety of text collections in different languages

available for license. Apart from the Vowelized Arabic Corpus, Appen also has a range of

Named Entity annotated texts. These are corpora of 500,000 words of news text that have

been annotated for persons, titles, quantities, geopolitical entities, locations, facilities, etc.

2. Morphological Analyzers — Our morphological analyzers are designed to generate

grammatically acceptable words using tagged stem dictionaries and information on

inflectional affixes and their combinations. They can manipulate text from languages with

non-Latin scripts and currently generate Urdu and Persian, including informal written

variants of affixes.

3. Thesaurus — Appen can undertake thesaurus development in several ways: from first

principles, as an extension to existing work or as validation of an existing thesaurus, with

consistency and coverage an important focus. Because each language is subtly different

and requires deep grammatical analysis to produce a quality product, native speakers are

always used to build a thesaurus. Appen can produce thesauri to client specifications as a

licensable database which is supplied in a standard XML format or to client specifications.

4. Language Analysis Documentat ion — Appen can provide comprehensive language

analysis documents under license for all languages of interest. These documents support

system and application developers and include phonological features and processes,

analysis of

Romanization schemes (where applicable), regional and dialectal differences and

population statistics of speakers. Appen can also provide analysis and recommendations

on specific collections for a nominated language.

Language Analysis Documents

Language DB Name List Pr ice Brief Descript ion Arabic (Iraqi) ARB_LAN001

USD 2,500

(per language)

The key topics that are typically covered in the language analysis document include:

• General Information about the

country • General Information about the

language • Language classification of the

language • Other Languages spoken in the

country • History of the language (where

relevant) including changes due to immigration etc

• Dialects of the language • maps indicating dialect regions • discussion of dialects –

distribution,

features etc.

• recommendations on a dialect distribution that would be feasible to use in a speech data collection

• Sound System of the language • Relevant Phonological Processes

prevalent in the language/country • Orthographic Conventions for the

language • Communications

Arabic

(North

Levantine)

ARB_LAN002

Bahasa

Indonesia BAH_LAN001

Brazilian

Portuguese PTB_LAN001

Croatian CRO_LAN001

Dari DAR_LAN001

English (US) ENG_LAN001

Farsi/Persian FAR_LAN001

French

(Canadian) FRC_LAN001

German DEU_LAN001

Hebrew HEB_LAN001

Japanese JAP_LAN001

Korean KOR_LAN001

Mandarin MAC_LAN001

Pashto PAS_LAN001

Russian RUS_LAN001

Serbian SRB_LAN001

Sorani (Kurdish) SOR_LAN001

Thai THA_LAN001

Urdu URD_LAN001

NER Corpora

Language DB Name Words List Pr ice Brief Descript ion

Arabic ARB_NER001

500,000

(per language)

USD 7,500

(per language)

Corpora containing text material collected from a variety of sources.

Each Text Corpus contains approximately

500,000 words and is

tagged for the following Named

Entities:

- Person

- Organization

- Location

- Nationality

- Religion

- Facility

- Geo-Political Entity

- Titles

English ENG_NER001

Farsi/

Persian FAR_NER001

Japanese JPY_NER001

Korean KOR_NER001

Mandarin MAC_NER001

Russian RUS_NER001

Urdu URD_NER001

Text Corpora Language Arabic (MSA)

DB Name ARB_THE001

DB type 2 Thesaurus

Words 28,000

List Pr ice Provided on request

Br ief Descript ion:

• The thesaurus contains 28,000 headwords

• For each headword, the following information is provided:

o Detailed Part-Of-Speech information including Verb (Intransitive/Transitive),

• Adverb, Noun, Adjective

o A broad definition in English

o Synonyms

o Antonyms

o A broad definition of the antonym group linked to the sense group

Text Corpora Language Arabic (MSA)

DB Name ARB_TXT001

DB type 2 Vowelized text corpus

Words 450,000

Brief Descript ion:

• This vowelised corpus is made up of 450,000 words of Arabic news text

• The text has been 100% manually vowelised and checked

Text Corpora Language Farsi/Persian

DB Name FAR_MOR001

DB type 2 Morphological Database

Words 0

Brief Descript ion:

• The Farsi/Persian morphological database comprises six files in text format:

-‐ a stems dictionary;

-‐ a dictionary of inflectional prefixes;

-‐ a dictionary of inflectional suffixes; and

-‐ three compatibility tables, which define the grammatically acceptable combinations

of stems, prefixes and suffixes for any given stem in the stems dictionary (prefix-

suffix; prefix-stem; suffix-stem).

• The format of the six files corresponds to the input format required by the Buckwalter

AraGen generation program. This program uses the input file to output the complete set of

potential words defined by the stem and affix dictionaries and compatibility tables

• All words and affixes in the six files are in a Romanized form (converted using an Appen

conversion table). Each word and affix is shown with and without short vowels. The form

with short vowels (the vowelized form) reflects the pronunciation of the word or affix

• SUMMARY OF CONTENTS

-‐ Stems in stem dictionary - 18,364 (including stem alternations)

-‐ Stems in stem dictionary - 16,492 (excluding stem alternations)

-‐ Number of suffixes: 506 (including zero suffix and variants of suffixes with

and without the zero width non-joiner character)

-‐ Number of prefixes: 14 (including zero prefix)

-‐ Number of unique words generated: 1,608,559

Text Corpora Language Urdu

DB Name URD_MOR001

DB type 2 Morphological Database

Words 0

Brief Descript ion:

• The Urdu morphological database comprises six files in text format:

-‐ a stems dictionary;

-‐ a dictionary of inflectional prefixes;

-‐ a dictionary of inflectional suffixes; and

-‐ three compatibility tables, which define the grammatically acceptable combinations

of stems, prefixes and suffixes for any given stem in the stems dictionary (prefix-

suffix; prefix-stem; suffix-stem)

• The format of the six files corresponds to the input format required by the Buckwalter

AraGen generation program. This program uses the input file to output the complete set of

potential words defined by the stem and affix dictionaries and compatibility tables.

• All words and affixes in the six files are in a Romanized form (converted using an Appen

conversion table). Each word and affix is shown with and without short vowels. The form

with short vowels (the vowelized form) reflects the pronunciation of the word or affix.

• SUMMARY OF CONTENTS

-‐ Stems in stem dictionary - 13,267 (including stem alternations)

-‐ Stems in stem dictionary - 13,116 (excluding stem alternations)

-‐ Number of suffixes: 115 (including zero suffix)

-‐ Number of prefixes: 1 (zero prefix)

-‐ Number of unique words generated: 31,109

Contact detailsAppen Pty Ltd

Level 69 Help Street

Chatswood, SydneyNSW 2067 Australia

Enquiries:

Sydney office: +61-2-9468-6335US sales enquiries: +1-315-335-4020

Europe: +31-622-799-535Japan & Korea: +1-202-765-7106

China: +61-2-9468-6310

sales@appen.com

www.appen.com

LanguageResources Catalog

Contact details - Appen · 2019-03-06 · number of licensable speech and language resources...

Documents

Transcript of Contact details - Appen · 2019-03-06 · number of licensable speech and language resources...

Appen dix K . Base -Year C o d e b o o k s

The 2019 International Joint Symposiumff.unair.ac.id/conferences/appen2019/files/content/1570525899-15-ABSTRACT-BOOK-APPEN...1 The 2019 International Joint Symposium 8th APPEN Conference

M AKE IT H APPEN - Walter Hölzler fileM AKE IT H APPEN Between Josune Bereziartu becoming the first woman in history to break the 9a barrier and Alex Megos making the first world

Appen musiziert 2014

Preliminary final report 1. Company details For personal use only 68 · 2020-02-24 · Preliminary final report 1. Company details Name of entity: Appen Limited ... Appen has six

Dominance and Competitive Bundlinghurkens.iae-csic.org/publicat/preprintversiondombundling.pdf · intensi es competition in a symmetric duopoly, Kim and Choi ... (i.e., licensable

Appen Limited Appendix 4E Preliminary final report …...Appen Limited Overview 31 December 2019 3 Appen’s mission, vision and values Our mission is to help build better artificial

Lect 1 Appen v Micro

Real world AI - Appen

Beyond X's and Y's: M aking A lgebraic T hinking H appen

Von Appen 2007_On the Aesthetics of Popular Music

Appen Limited Appendix 4D Half -year report 1. Company details€¦ · Company details Name of entity: Appen Limited ABN: 60 138 878 298 Reporting period: For the half-year ended

Appen Limited - ASX...2020/08/27 · Appen Limited disclaims any obligation or undertaking to disseminate any updates or revisions to any forward looking statements in these materials

[It]Deft Manuale No Appen

Ch01 Appen

Dallas Executive Airport Master Plan Appen-d

Special collections and archives: Elmfield Collectionlibrarysupport.shef.ac.uk/elmfield.pdfELMFIELD COLLECTION 7 Appen, A. A. Temperaturoustoichivye neorganicheskie pokrytiya. Leningrad

Ch04 Appen

aking projects appen - exigere...aking projects appen Residential Accurate Calm Assured Benjamin Street High quality apartments in a prime Farringdon location. Client: The Girdlers’

Part Measures for Supporting Private Companies and ... · the licensable patents registered in the Patent Licensing Database, and created "Business Examples of Licensable Patent Utilization,"