Post on 03-Jan-2020
TM
Contact detailsAppen Pty Ltd
Level 69 Help Street
Chatswood, SydneyNSW 2067 Australia
Enquiries:
Sydney office: +61-2-9468-6335US sales enquiries: +1-315-335-4020
Europe: +31-622-799-535Japan & Korea: +1-202-765-7106
China: +61-2-9468-6310
sales@appen.com
www.appen.com
LanguageResources Catalog
Table of Contents
A global leader in linguistic technology solutions 3
Speech Databases - Summary 7
Speech Databases - Detailed 11
Lexica 84
Other Language Resources 88
Appen brings the forefront of speech and language technology to you. We deliver the highest quality in linguistic solutions to government agencies and the world’s largest corporations, with proven expertise in over 150 languages.
We understand the complex linguistic needs of today’s leading organizations. Our unparalleled range of resources and solutions gives you the edge in a wide array of applications, including:
• speech recognition
• text-to-speech synthesis
• speech analytics
• machine translation
• natural language processing
Appen’s reputation as a global leader guarantees you:
• flexibility and rapid response capability
• global coverage in over 150 languages
• highly qualified specialist personnel
• large, closely vetted crowds of in-country native speakers
• tight project management
• keen innovation and creativity
• strict client confidentiality
Appen remains fully independent of any systems provider, although we do enter into close strategic relationships with selected clients. We have been a principal sub-contractor on several European consortium projects, also in addition to supporting similar projects funded by DARPA and other US agencies.
Whatever speech and language data you need for your application, Appen will collect it for you.
Our end-to-end data collection service delivers efficiency and quality, even on multiple large-scale collections in parallel.
Available collection types include:
• telephony – fixed-line, mobile, in-car
• embedded device – in-car, desktop, smartphone, tablet
• single/multi-speaker – speakers selected by demographic or other requirements
• prompt variation – scripted, spontaneous, conversational (dialogue), meeting data
• modality – speech, text, handwriting, gesture, image and other acoustic data
• text corpora and other resources – email, SMS, named entity tags, POS tags
As part of a standard collection, we offer you the following:
• detailed linguistic and cultural research
• script preparation and localization
• crowdsourcing of native speakers
• local and remote speech recording
• transcription and annotation of collected data
• quality assurance and project management
• lexicon entries matching database contents
• packaging of database in a coherent format
A global leader in linguistic technology solutions
Data Collection
A global leader in linguistic technology solutions
Appen provides high quality speech and language technology products and services to technology developers and government organizations, and is recognized as a global leader in the quality and coverage of its products and services.
Our products and services cover a wide range of applications in speech recognition, text-‐to-‐speech synthesis, phonetic search, machine translation and text processing including Natural Language Processing (NLP).
Appen’s client base includes both government agencies and the world’s largest and most respected IT organizations. Our objective is to enhance our clients’ capabilities in the fields of speech and language technology by offering:
• fast-‐track production • tight project management, working to strict timing,
quality and productivity criteria • specialist personnel including highly qualified linguists
and computational linguists to support our customers’ internal resources, particularly in response to surge requirements
• flexibility and rapid response capability which may be difficult for larger organizations to achieve
• global coverage which may be difficult for smaller organizations to achieve
• large crowds of in-‐country native speakers that have passed our screening processes
• high levels of innovation • strict client confidentiality
Appen remains fully independent of any systems provider, although we can and do enter into close strategic relationships with selected clients. We have been a principal sub-‐contractor on several European consortium projects, such as SpeeCon (multiple projects); SALA II (multiple projects); LILA (multiple projects); Orientel and LC-‐Star. Appen has also supported several DARPA and other US-‐funded consortium projects.
Appen Catalogue – Speech and Language Resources
Appen has a large number of licensable speech and language resources currently available and in development. Most of the 150+ languages that Appen has worked in are included in off-‐the-‐shelf offerings. Up-‐to-‐date catalogue information is available at appen.com
Licensable materials cover:
• Fully transcribed speech databases for broadcast, embedded, in-‐car and telephony applications
• Pronunciation lexicons to provide both general and domain specific coverage for a given language (specific categories include names, places, natural numbers)
• Part-‐Of-‐Speech tagged lexicons and Thesauri to support a wide range of Speech and Language Technology development activities
• Corpora annotated for Part-‐Of-‐Speech, Morphological Information, Named Entities
• Parallel Corpora for use in the development of Machine Translation
Appen’s licensable Speech and Language resources offer wide coverage of less commonly taught languages, including languages and dialects of West and North Asia, the Middle East and Africa.
In many cases, licensable resources can be developed on request to meet a particular client’s requirements.
Appen Catalogue – Speech and Language Resources
We use AppenScribe, our proprietary web-based transcription interface, to deliver high-volume, high-quality data transcription and annotation to you.
Whether you are working with speech, text, video or handwriting, AppenScribe supports a large number of languages in native orthography.
Our transcription and annotation services include:
• orthographic transcription
• acoustic event transcription
• phonetic and phonemic transcription
• semantic annotation and Named Entity tagging
• annotation of handwriting and other language data
• TTS evaluation through the provision of MOS scores
• time alignment of transcription and acoustic signal
While we have experience in processing millions of US English utterances in a matter of weeks, we are equally practiced in languages like Somali which lack a standardized written form.
We ensure the highest quality of work through:
• screening and training of in-country transcribers
• automated spelling checks
• rigorous post-processing by senior team members
If you need immediate access to a complete speech and language database, Appen has a long list of licensable resources available. See www.appen.com for our latest catalogue.
Our high-quality licensable materials cover:
• fully transcribed speech databases for broadcast, call center, in-car and telephony applications
• pronunciation lexicons, both general and domain-specific (e.g. names, places, natural numbers)
• POS-tagged lexicons and thesauri
• corpora annotated for POS, morphological information and named entities
• parallel corpora for use in the development of machine translation
Appen’s databases also cover less resourced languages, including dialects of West and North Asia, the Middle East and Africa.
Transcription and Annotation
We offer you the collective expertise of our premier network of freelance consultants around the globe, currently covering over 60 languages.
Appen’s team of over 1,000 highly qualified consultants includes:
• linguists, phoneticians and lexicographers
• language specialists with backgrounds in translation, localization, terminology, education and library sciences
• data annotators with experience in Internet research and search evaluation
Among the key benefits we offer to you:
• specialized resources for custom linguistic consulting
• resource pools of language specialists in over 60 countries
• large-scale recruiting and training for rapid market expansion
• on-demand staffing to respond to urgent project changes
Contact us directly for additional information and project-specific enquiries
Appen’s highly trained evaluation teams maximize the relevance of your search engine in over twenty local markets around the world.
Our in-country search experts each review hundreds of queries daily, ranking results for relevance to user input.
Our teams are familiar with search trends, popular and obscure topics, and the linguistic nuances of your search engine’s target users.
In addition to general-purpose search, we also specialize in vertical categories, including:
• local
• news
• medical
• travel
• finance
• shopping
• social
We also provide you with valuable testing of search features, such as:
• spam filtering
• related query suggestion
• duplicate removal
• business listing verification
• caption generation
Search Relevance Evaluation
Human Resourcing and Crowdsourcing
• Afrikaans
• Arabic (15+ varieties)
• Assamese
• Bahasa Indonesia
• Bahasa Malaysia
• Bakhtiari (Iran)
• Basque
• Bengali
• Bulgarian
• Cantonese (China PRC, China Hong Kong)
• Catalan
• Croatian
• Czech
• Danish
• Dari
• Dutch (Netherlands, Belgium)
• English (10+ varieties)
• Estonian
• Farsi
• Finnish
• French (5 varieties)
• Gallego (Galician)
• German (Austrian, German, Luxembourg, Swiss)
• Greek
• Gujarati
• Haitian Creole
• Hausa
• Hebrew
• Hindi
• Hungarian
• Italian
• Japanese
• Kannada
• Kermanji (Iran)
• Korean (North, South)
• Kurdish (Sorani)
• Laki (Iran)
• Latvian
• Lithuanian
• Luri (Iran)
• Malayalam
• Malagasy
• Mandarin (China, Taiwan)
• Marathi
• Mazanderani (Iran)
• Min
• Norwegian (Nynorsk, Bokmal)
• Oriya
• Pashto
• Polish
• Portuguese (Brazilian, European)
• Romanian
• Russian
• Serbian
• Slovak
• Slovenian
• Somali
• Spanish (15+ varieties)
• Swedish
• Sylheti
• Tagalog
• Tamil
• Telugu
• Thai
• Turkish
• Ukrainian
• Urdu
• Vietnamese
• Wu
• Xiang
Languages covered
The list of languages in which Appen works is continually expanding, and includes:
Capability for additional languages can, on request, be developed rapidly.
Data
base
s - S
umm
ary
9
Database - Summary
Language Name Database Type Speakers SamplingAudio Hrs
Price
Arabic CGA_ASR001 Microphone, Scripted Speech 150 16.00 345 USD 20,000
Arabic (Eastern Algerian)
EAR_ASR001 Telephony (cell and fixed), Conversational Speech
496 8.00 58 USD 57,500
Arabic English ENA_ASR001 Conversational Telephony 250 8.00 56 USD 35,000
Arabic (MSA) MSA_ASR001 Microphone, Scripted Speech 78 16.00 12 EUR 3,600
Bahasa Indonesia BAH_ASR001 Telephony (cell and fixed), Conversational Speech
1002 8.00 63 USD 45,000
Bengali BEN_ASR001 Telephony (cell and fixed), Conversational Speech
1000 8.00 94 USD 45,000
Bulgarian BUL_ASR001 Telephony (cell and fixed), Conversational Speech
217 8.00 77 USD 30,000
BUL_ASR002 Microphone, Scripted Speech 77 16.00 22 EUR 3,600
Croatian CRO_ASR001 Telephony (cell and fixed), Conversational Speech
200 8.00 79 USD 30,000
CRO_ASR002 Microphone, Scripted Speech 94 16.00 11 EUR 3,600
Czech CZE_ASR001 Microphone, Scripted Speech 102 16.00 31 EUR 3,600
Dari DAR_ASR001 Telephony (cell and fixed), Conversational Speech
500 8.00 80 USD 45,000
DAR_BRC001 Broadcast Data 0.00 40 USD 22,500
Dutch (Netherlands)
NLD_ASR001 Telephony (cell and fixed), Conversational Speech
200 8.00 73 USD 30,000
English (Australian)
AUS_ASR001 Telephony (cell and fixed), Conversational Speech
500 8.00 94 USD 20,000
AUS_ASR002 Telephony (cell and fixed), Scripted Speech
1000 8.00 120 USD 31,500
English (Canadian)
ENC_ASR001 Telephony (cell and fixed), Scripted Speech
1000 8.00 144 USD 37,500
English (Indian)
ENI_ASR001 Telephony (cell and fixed), Scripted Speech
2358 8.00 225 USD 45,000
Data
base
s - S
umm
ary
10
Database - Summary
Language Name Database Type Speakers SamplingAudio Hrs
Price
Indian English ENI_ASR002 Conversational Telephony 540 8.00 135 USD 28,000
English (UK) UKE_ASR001 Telephony (cell and fixed), Conversational Speech
1150 8.00 102 USD 45,000
UKE_ASR002 Voicemail Telephony, Spontaneous Speech
592 8.00 69 USD 37,500
English (US) USE_ASR001 Microphone, Scripted Speech 200 48.00 124 USD 15,000
USE_ASR002 Telephony (cell and fixed), Conversational Speech
20 8.00 14 USD 7,500
Farsi/Persian FAR_ASR001 Telephony (cell and fixed), Scripted Speech
789 8.00 85 USD 45,000
FAR_ASR002 Telephony (cell and fixed), Conversational Speech
1000 8.00 61 USD 57,500
Filipino English ENF_ASR001 Conversational Telephony 450 8.00 107 USD 35,000
French (Canadian)
FRC_ASR001 Telephony (cell and fixed), Scripted Speech
1000 8.00 131 USD 37,500
FRC_ASR002 Microphone, Scripted Speech 120 16.00 46 USD 22,500
FRC_ASR003 Telephony (cell and fixed), Conversational Speech
251 8.00 20 USD 31,500
French (European)
FRF_ASR001 Telephony (cell and fixed), Conversational Speech
563 8.00 50 USD 31,500
FRF_ASR002 Voicemail Telephony, Spontaneous Speech
560 8.00 95 USD 37,500
FRF_ASR003 Microphone, Scripted Speech 98 16.00 26 EUR 3,600
German DEU_ASR001 Microphone, Scripted Speech 127 16.00 33 USD 11,500
DEU_ASR002 Voicemail Telephony, Spontaneous Speech
890 8.00 65 USD 37,500
DEU_ASR003 Microphone, Scripted Speech 77 16.00 25 EUR 3,600
Data
base
s - S
umm
ary
11
Database - Summary
Language Name Database Type Speakers SamplingAudio Hrs
Price
Hausa HAU_ASR001 Microphone, Scripted Speech 103 16.00 20 EUR 3,600
HAU_ASR002 Telephony (cell), Conversational Speech
200 8.00 66 USD 40,000
Hebrew HEB_ASR001 Telephony (cell and fixed), Conversational Speech
200 8.00 69 USD 30,000
Hindi HIN_ASR001 Telephony (cell), Scripted Speech 1920 8.00 224 USD 45,000
HIN_ASR002 Telephony (cell and fixed), Conversational Speech
996 8.00 65 USD 45,000
Italian ITA_ASR001 Microphone, Scripted Speech 200 22.05 177 USD 12,500
ITA_ASR002 Microphone, Scripted Speech 103 48.00 189 USD 19,500
ITA_ASR003 Telephony (cell and fixed), Conversational Speech
200 8.00 72 USD 30,000
ITA_ASR004 Voicemail Telephony, Spontaneous Speech
550 8.00 123 USD 37,500
ITA_TTS001 Microphone, Scripted Speech 1 22.05 3 USD 11,500
Japanese JPN_ASR001 Microphone, Scripted Speech 144 16.00 33 EUR 3,600
Kannada KAN_ASR001 Telephony (cell and fixed), Conversational Speech
1000 8.00 30 USD 45,000
Korean KOR_ASR001 Microphone, Scripted Speech 100 16.00 20 EUR 3,600
Mandarin MAC_ASR001 Telephony (cell), Mixed environments 2000 8.00 115 USD 45,000
MAC_ASR002 Microphone, Scripted Speech 132 16.00 26 EUR 3,600
Marathi MAR_ASR001 Telephony (cell and fixed), Conversational Speech
1000 8.00 30 USD 45,000
Pashto PAS_ASR001 Telephony (cell and fixed), Conversational Speech
967 8.00 111 USD 65,000
PAS_ASR002 Conversational microphone data 40 16.00 80 USD 75,000
PAS_BRC001 Broadcast Data 0.00 51 USD 22,500
Data
base
s - S
umm
ary
12
Database - Summary
Language Name Database Type Speakers SamplingAudio Hrs
Price
Polish POL_ASR001 Microphone, Scripted Speech 99 16.00 25 EUR 3,600
Portuguese (Brazilian)
PTB_ASR001 Microphone, Scripted Speech 102 16.00 26 EUR 3,600
PTB_ASR002 Telephony (cell and fixed), Conversational Speech
200 8.00 66 USD 35,000
Portuguese (European)
PTP_ASR001 Telephony (cell and fixed), Conversational Speech
200 8.00 72 USD 30,000
Romanian ROM_ASR001 Telephony (cell and fixed), Conversational Speech
200 8.00 74 USD 30,000
Russian RUS_ASR001 Telephony (cell and fixed), Conversational Speech
200 8.00 74 USD 30,000
RUS_ASR002 Microphone, Scripted Speech 115 16.00 31 EUR 3,600
Somali SOM_ASR001 Telephony (cell and fixed), Conversational Speech
1000 8.00 101 USD 65,000
Sorani (Kurdish) SOR_ASR001 Telephony (cell and fixed), Conversational Speech
170 8.00 11 USD 30,000
Spanish (European)
ESP_ASR001 Microphone, Scripted Speech 200 22.05 159 USD 12,500
ESP_ASR002 Voicemail Telephony, Spontaneous Speech
512 8.00 97 USD 37,500
ESP_TTS001 Microphone, Scripted Speech 1 22.05 1 USD 6,000
Spanish (Latin America)
ESL_ASR001 Microphone, Scripted Speech 100 16.00 17 EUR 3,600
Swedish SWE_ASR001 Microphone, Scripted Speech 98 16.00 30 EUR 3,600
Thai THA_ASR001 Microphone, Scripted Speech 98 16.00 35 EUR 3,600
Turkish TUR_ASR001 Telephony (cell and fixed), Conversational Speech
200 8.00 83 USD 30,000
TUR_ASR002 Microphone, Scripted Speech 100 16.00 17 EUR 3,600
Urdu URD_ASR001 Telephony (cell and fixed), Conversational Speech
1000 8.00 95 USD 45,000
Vietnamese VIE_ASR001 Microphone, Scripted Speech 129 16.00 47 EUR 3,600
Data
base
s - D
etai
led
13
Databases
Language Arabic
DB Name CGA_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 150
Prompts per speaker 280
Total utterances/Entr ies 42,000
Audio Hours 345
Sampling rate - kHz 16.00
Recording channels 4
List Pr ice USD 20,000
Brief Descript ion
• This is a 150 speaker microphone recorded database Language Materials
• Each script elicits approximately 30 minutes of recorded speech
• Each Script includes:
o 30 Person names (first name and family name) from a set of 150
o 10 single isolated digits 0-9
o 10 8-digit sequences (randomly generated)
o 200 Phonetically balanced sentences
o 30 10-word phonetically balanced word strings
Demographics
• 50% of speakers are from the United Arab Emirates
• 50% of speakers are from Saudi Arabia
Transcriptions
• Complete transcriptions of the content of the speech files at a word level
• All acoustic events have been tagged using conventions derived from the SpeechDAT
model
• All transcriptions fully vowelized
Contact Appen for further information
Data
base
s - D
etai
led
14
Databases
Language Arabic (Eastern Algerian)
DB Name EAR_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Home/office
Speakers 496
Prompts per speaker
Total utterances/Entr ies
Audio Hours 58
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 57,500
Brief Descript ion
• This is a 496 speaker conversational** telephony database
• Approximately 29 hours of conversation data (equivalent to 58 hours of single channel
audio)
• Broad distribution of age, gender and dialects (Algiers and Constantine)
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For
a smaller number of calls, only one half of the conversation was collected and transcribed
Contact Appen for further information
Data
base
s - D
etai
led
15
Databases
Language Arabic English
DB Name ENA_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Unique speakers 250
Average cal l length 10-15 minutes
Total utterances/Entr ies N/A
Audio Hours 56
Sampling rate – kHz 8.00
Recording channels 2
List Pr ice USD 35,000
Brief Descript ion
• 115 telephony conversations are recorded for this project
• Demographic information is as follows:
o Roughly equal distribution of male and female
o Broad range of ages from 18 years – 55 years
o Approximately 50% landline/50% mobile
o Speakers speak on a range of generic topics
o Roughly equal distribution of Levantine Arabic and Egyptian Arabic speakers
• Approximately 28 hours of conversation data (equivalent to 56 hours of single channel
audio)
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
16
Databases
Language Arabic (MSA)
DB Name MSA_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Low background noise
Speakers 78
Prompts per speaker
Total utterances/Entr ies 4,908
Audio Hours 12
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 78 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
17
Databases
Language Bahasa Indonesia
DB Name BAH_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 1,002
Prompts per speaker
Total utterances/Entr ies
Audio Hours 63
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 45,000
Brief Descript ion
• This is a 1,002 speaker conversational** telephony database
• Approximately 31 hours of conversation data (equivalent to 63 hours of single channel
audio)
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
** For a large proportion of calls, only one half of the conversation was collected and transcribed
Contact Appen for further information
Data
base
s - D
etai
led
18
Databases
Language Bengali
DB Name BEN_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 1,000
Prompts per speaker
Total utterances/Entr ies
Audio Hours 94
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 45,000
Brief Descript ion
• This is a 1,000 speaker conversational telephony database
• Approximately 47 hours of conversation data (equivalent to 94 hours of single
channel audio)
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
19
Databases
Language Bulgarian
DB Name BUL_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Home/office
Speakers 217
Prompts per speaker
Total utterances/Entr ies
Audio Hours 77
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice EUR 3,600
Brief Descript ion
• This is a 200 speaker conversational telephony database
• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls
each (1 from landline, 1 from mobile), to a pool of 100 call receivers
• Approximately 38 hours of conversation data (equivalent to 77 hours of single
channel audio)
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
20
Databases
Language Bulgarian
DB Name BUL_ASR002
DB type 1 ASR
DB type 2 Microphone
Environments Low background noise
Speakers 77
Prompts per speaker
Total utterances/Entr ies 8,674
Audio Hours 22
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 77 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT).
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable).
Contact Appen for further information
Data
base
s - D
etai
led
21
Databases
Language Croatian
DB Name CRO_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Home/office
Speakers 200
Prompts per speaker
Total utterances/Entr ies
Audio Hours 79
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice EUR 3,600
Brief Descript ion
• This is a 200 speaker conversational telephony database
• 200 telephony conversations are recorded for this project – 100 speakers make 2 calls
each (1 from landline, 1 from mobile), to a pool of 100 call receivers
• Approximately 39 hours of conversation data (equivalent to 79 hours of single
channel audio)
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
22
Databases
Language Croatian
DB Name CRO_ASR002
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 94
Prompts per speaker
Audio Hours 11
Total utterances/Entr ies 4,499
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 94 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
23
Databases
Language Czech
DB Name CZE_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Low background noise
Speakers 102
Prompts per speaker
Total utterances/Entr ies 12,425
Audio Hours 31
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 102 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
24
Databases
Language Dari
DB Name DAR_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 500
Prompts per speaker
Total utterances/Entr ies
Audio Hours 80
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 45,000
Brief Descript ion
• This is a 500 speaker conversational telephony database
• Approximately 40 hours of conversation data (equivalent to 80 hours of single channel
audio)
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
• Telephony Distribution
o Landline 13%
o Mobile 87%
Contact Appen for further information
Data
base
s - D
etai
led
25
Databases
Language Dari
DB Name DAR_BRC001
DB type 1 Broadcast
DB type 2 Broadcast Data
Environments Broadcast Data
Speakers
Prompts per speaker
Total utterances/Entr ies
Audio Hours 40
Sampling rate - kHz 0.00
Recording channels 1
List Pr ice USD 22,500
Brief Descript ion
• Database contains 40 hours of Dari broadcast data
• Database is largely speech only and does not include music or advertisements
• Data types include:
o Talk shows
o Interviews
o News broadcasts (excluding news reading by anchors)
• Database is fully transcribed and timestamped
Contact Appen for further information
Data
base
s - D
etai
led
26
Databases
Language Dutch (Netherlands)
DB Name NLD_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 200
Prompts per speaker
Total utterances/Entr ies
Audio Hours 73
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 30,000
Brief Descript ion
• This is a 200 speaker conversational telephony database
• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls
each (1 from landline, 1 from mobile), to a pool of 100 call receivers
• Approximately 36 hours of conversation data (equivalent to 73 hours of single channel
audio)
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
27
Databases
Language English (Australian)
DB Name AUS_ASR001
DB type 1 ASR
DB type 2 Telephony
Environments Home/office
Speakers 500
Prompts per speaker 165
Total utterances/Entr ies 82,500
Audio Hours 94
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 20,000
Brief Descript ion
• This is a 500 speaker telephony database
• 500 Speakers (including some migrant representation - Asian (predominantly Chinese),
Middle Eastern (Predominantly Lebanese), and New Zealand accented English
• 165 prompts (read speech) per speaker, including:
o Digits
o Natural Numbers
o Letter strings
o Personal, place, and business names
o Confirmation items (yes, no + fuzzy)
o Generic Command and Control items (from a set of 215)
o Phonetically rich Sentences and Words
• Mobile 50%, fixed line 50%
• Age and Gender balanced
• Moderately quiet environments (home/office)
• Total audio length: 94 hours
• Fully transcribed to SpeechDAT type conventions
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
28
Databases
Language English (Australian)
DB Name AUS_ASR002
DB type 1 ASR
DB type 2 Telephony
Environments Mixed
Speakers 1,000
Prompts per speaker 75
Total utterances/Entr ies 75,000
Audio Hours 120
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 31,500
Brief Descript ion
• This is a 1,000 speaker Australian English database
• 75 prompts per speaker, including:
o Digits
o Natural Numbers
o Letter strings
o Personal, place, and business names
o Confirmation items (yes, no + fuzzy)
o Generic Command and Control items
o Phonetically rich Sentences and Words
• The prompts are a mixture of 'read' and 'elicited' items. 5 prompts per script are
'spontaneous free speech’
• Mixture of mobile and landline
• Age and Gender balanced
• Total audio length: 120 hours
• Fully transcribed to SpeechDAT type conventions
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
29
Databases
Language English (Canadian)
DB Name ENC_ASR001
DB type 1 ASR
DB type 2 Telephony
Environments Mixed
Speakers 1,000
Prompts per speaker 99
Total utterances/Entr ies 99,000
Audio Hours 144
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 37,500
Brief Descript ion
• This is an extended SALA II database.
• 49 prompts per speaker are as specified by the SALA II consortium. An additional 50
prompts (similar content) were recorded by each speaker.
• 99 prompts per speaker, including:
o Digits
o Natural Numbers
o Letter strings
o Personal, place, and business names
o Confirmation items (yes, no + fuzzy)
o Generic Command and Control items
o Phonetically rich Sentences and Words
• Mobile telephony recorded in a range of environments including in-car, home/office,
roadside and other public place
• Fully transcribed to SALA II/SpeechDAT type conventions
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
30
Databases
Language English (Indian)
DB Name ENI_ASR001
DB type 1 ASR
DB type 2 Telephony
Environments Mixed
Speakers 2,358
Prompts per speaker 50
Total utterances/Entr ies 117,900
Audio Hours 225
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 45,000
Brief Descript ion
• This is a 2,358 speaker Indian English mobile telephony speech database recorded on
location in India
• Database Type
o Medium Level background noise - in-car, home/office, roadside and other public
place type environments
o Total audio length - Approximately 225 hours
• Demographics
o 2,358 speakers recorded in India
o 50% male, 50% female
o Broad distribution of age groups (16-60 years) and dialects
• Language Materials
o 50 prompts per speaker, including Digits; Natural Numbers; Personal, Place, and
Business Names; Confirmation items (yes, no + fuzzy); Generic Command and
Control items and Phonetically rich Sentences and Words
• Transcription and Lexicon
o Fully transcribed to SpeechDAT type conventions.
o Database is accompanied by a pronunciation lexicon [SAMPA] containing all
transcribed words.
o Lexicon - 10,128 unique headwords
Contact Appen for further information
Data
base
s - D
etai
led
31
Databases
Language Indian English
DB Name ENI_ASR002
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Unique speakers 540
Average cal l length 10-15 minutes
Total utterances/Entr ies N/A
Audio Hours 135
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 28,000
Brief Descript ion
• 271 telephony conversations are recorded for this project
• Demographic information is as follows:
o Roughly equal distribution of male and female
o Broad range of ages from 16 years – 60 years
o Dialect distribution:
Eastern India 10%
Northern India 35%
Pakistan 15%
Southern India 20%
Western India 19%
o Approximately 50% landline/50% mobile
o Speakers speak on a range of generic topics
• Approximately 67 hours of conversation data (equivalent to 135 hours of single channel
audio).
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
32
Databases
Language English (UK)
DB Name UKE_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 1,150
Prompts per speaker
Total utterances/Entr ies
Audio Hours 102
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 45,000
Brief Descript ion
• This is a 1,150 speaker conversational telephony database
• Provides good coverage of key accents across the UK and Ireland
• Approximately 51 hours of conversation data (equivalent to 102 hours of single channel
audio).
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
• Note: additional data available - please contact Appen for more details
Contact Appen for further information
Data
base
s - D
etai
led
33
Databases
Language English (UK)
DB Name UKE_ASR002
DB type 1 ASR
DB type 2 Voicemail Telephony
Environments Low background noise
Speakers 592
Prompts per speaker
Total utterances/Entr ies
Audio Hours 69
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 37,500
Brief Descript ion
• This is a 592 speaker voicemail telephony database
• Broad distribution of age, gender and landline/mobile coverage
• Provides good representation of key accents across the United Kingdom
• Approximately 69 audio hours of voicemail data
• The database covers speakers providing spontaneous voicemail type responses selected
from a pool of approx. 200 common voicemail scenarios (e.g. Leave a message to tell your
colleague that you are running late for a meeting)
• The database is fully transcribed and is accompanied by a pronunciation lexicon
containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
34
Databases
Language English (US)
DB Name USE_ASR001
DB type 1 ASR
DB type 2 Studio/microphone recordings
Environments Studio
Speakers 200
Prompts per speaker 400
Total utterances/Entr ies 80,000
Audio Hours 124
Sampling rate - kHz 48.00
Recording channels 2
List Pr ice USD 15,000
Brief Descript ion
• This is a 200 speaker microphone recorded database
• Each speaker read 400 prompts including:
o Digits
o Natural Numbers
o Personal and City names
o Telephone Numbers
o Generic Command and Control items
o Phonetically rich Sentences and Words
• All speakers were recorded in a studio type environment in USA
• Database is fully transcribed and is accompanied by a pronunciation lexicon containing all
transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
35
Databases
Language English (US)
DB Name USE_ASR002
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 20
Prompts per speaker
Total utterances/Entr ies
Audio Hours 14
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 7,500
Brief Descript ion
• This is a 20 speaker conversational telephony database
• Call-Centre style conversations
• Approximately 7 hours of conversation data in total
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
36
Databases
Language Farsi/Persian
DB Name FAR_ASR001
DB type 1 ASR
DB type 2 Telephony
Environments Mixed
Speakers 789
Prompts per speaker 48
Audio Hours 85
Total utterances/Entr ies 38,400
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 45,000
Brief Descript ion
• This is a 789 speaker Farsi telephony speech database recorded on location in Iran.
• 50% male, 50% female
• Broad distribution of age groups (16-60 years) and dialects
• Medium Level background noise - in-vehicle, home/office, roadside and other public place
type environments
• Language Materials
o 48 prompts per speaker, including Digits; Natural Numbers; Letter strings;
Personal, Place, and Business names; Confirmation items (yes and no); Generic
Command and Control items and Phonetically Rich sentences and words
• Transcriptions
o Fully transcribed to OrienTel type conventions
• Lexicon
o Database is accompanied by a pronunciation lexicon [SAMPA] containing all
transcribed words
• Total audio length - Approximately 85 hours
Contact Appen for further information
Data
base
s - D
etai
led
37
Databases
Language Farsi/Persian
DB Name FAR_ASR002
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Mixed
Speakers 1,000
Prompts per speaker
Total utterances/Entr ies
Audio Hours 61
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 57,500
Brief Descript ion
• This is a 1,000 speaker conversational telephony database
• Approximately 30 hours of conversation data (equivalent to 61 hours of single channel
audio)
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
38
Databases
Language Filipino English
DB Name ENF_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Unique speakers 450
Average cal l length 10-15 minutes
Total utterances/Entr ies N/A
Audio Hours 107
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 35,000
Brief Descript ion
• 216 telephony conversations are recorded for this project
• Demographic information is as follows:
o Roughly equal distribution of male and female
o Broad range of ages from 18 years – 70 years
o Approximately 50% landline/50% mobile
o Speakers speak on a range of generic topics
• Approximately 53 hours of conversation data (equivalent to 107 hours of single channel
audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
39
Databases
Language French (Canadian)
DB Name FRC_ASR001
DB type 1 ASR
DB type 2 Telephony
Environments Mixed
Speakers 1,000
Prompts per speaker 100
Total utterances/Entr ies 100,000
Audio Hours 131
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 37,500
Brief Descript ion
• This is an extended SALA II database
• 48 prompts per speaker are as specified by the SALA II consortium. An additional 52
prompts (similar content) were recorded by each speaker
• 100 prompts per speaker, including:
o Digits
o Natural Numbers
o Letter strings
o Personal, place, and business names
o Confirmation items (yes, no + fuzzy)
o Generic Command and Control items
o Phonetically rich Sentences and Words
• Mobile telephony recorded in a range of environments including in-car, home/office,
roadside and other public place
• Total audio length: 131 hours
• Fully transcribed to SpeechDAT type conventions
• Database is accompanied by a pronunciation lexicon [SAMPA] containing all
transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
40
Databases
Language French (Canadian)
DB Name FRC_ASR002
DB type 1 ASR
DB type 2 Microphone recordings
Environments Home/office
Speakers 120
Prompts per speaker 150
Total utterances/Entr ies 22,500
Audio Hours 46
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice USD 22,500
Brief Descript ion
• This is a 120 speaker microphone recorded database
• Scripts include:
o Person names
o Digits
o Digit strings (randomly generated)
o Addresses
o Phonetically rich sentences
• Dialects
o 50% Quebecois – Montreal
o 50% Quebecois – Other
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
41
Databases
Language French (Canadian)
DB Name FRC_ASR003
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 251
Prompts per speaker
Total utterances/Entr ies
Audio Hours 20
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 31,500
Brief Descript ion
• This is a 251 speaker conversational** telephony database
• Approximately 10 hours of conversation data (equivalent to 20 hours of single channel
audio)
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For a
small number of calls, only one half of the conversation was collected and transcribed
Contact Appen for further information
Data
base
s - D
etai
led
42
Databases
Language French (European)
DB Name FRF_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 563
Prompts per speaker
Total utterances/Entr ies
Audio Hours 50
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 31,500
Brief Descript ion
• This is a 563 speaker conversational** telephony database
• Approximately 25 hours of conversation data (equivalent to 50 hours of single channel
audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For a
smaller number of calls, only one half of the conversation was collected and transcribed
Contact Appen for further information
Data
base
s - D
etai
led
43
Databases
Language French (European)
DB Name FRF_ASR002
DB type 1 ASR
DB type 2 Voicemail Telephony
Environments Low background noise
Speakers 560
Prompts per speaker
Total utterances/Entr ies
Audio Hours 95
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 37,500
Brief Descript ion
• This is a 560 speaker voicemail telephony database
• Broad distribution of age, gender and landline/mobile coverage
• Provides good representation of key accents across France
• Approximately 47 audio hours of voicemail data
• The database covers speakers providing spontaneous voicemail type responses selected
from a pool of approx. 200 common voicemail scenarios (e.g. Leave a message to tell your
colleague that you are running late for a meeting)
• The database is fully transcribed and is accompanied by a pronunciation lexicon
containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
44
Databases
Language French (European)
DB Name FRF_ASR003
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 98
Prompts per speaker
Total utterances/Entr ies 10,273
Audio Hours 26
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 98 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
45
Databases
Language German
DB Name DEU_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Studio
Speakers 127
Prompts per speaker 100
Total utterances/Entr ies 12,700
Audio Hours 33
Sampling rate - kHz 16.00
Recording channels 2
List Pr ice USD 11,500
Brief Descript ion
• This is a 127 speaker microphone recorded database
• Each speaker read 100 prompts including:
o Digits
o Natural Numbers
o Personal and City names
o Telephone Numbers
o Generic Command and Control items
o Phonetically rich Sentences and Words
• All speakers were recorded in a studio type environment in Germany
• Database is fully transcribed and is accompanied by a pronunciation lexicon containing all
transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
46
Databases
Language German
DB Name DEU_ASR002
DB type 1 ASR
DB type 2 Voicemail Telephony
Environments Low background noise
Speakers 890
Prompts per speaker
Total utterances/Entr ies
Audio Hours 65
Sampling rate - kHz 8.00
Recording channels
List Pr ice USD 37,500
Brief Descript ion
• This is an 890 speaker voicemail telephony database
• Broad distribution of age, gender and landline/mobile coverage
• Provides good representation of key accents across Germany
• Approximately 65 audio hours of voicemail data
• The database covers speakers providing spontaneous voicemail type responses selected
from a pool of approx. 50 common voicemail scenarios (e.g. Leave a message to tell your
colleague that you are running late for a meeting)
• The database is fully transcribed and is accompanied by a pronunciation lexicon
containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
47
Databases
Language German
DB Name DEU_ASR003
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 77
Prompts per speaker
Total utterances/Entr ies 10,085
Audio Hours 25
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 77 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT).
• Each speaker reads a number phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
48
Databases
Language Hausa
DB Name HAU_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 103
Prompts per speaker
Total utterances/Entr ies 7,895
Audio Hours 20
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 103 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
49
Databases
Language Hausa
DB Name HAU_ASR002
DB type 1 ASR
DB type 2 Conversational telephony
Environments Low background noise
Speakers 200
Prompts per speaker
Total utterances/Entr ies
Audio Hours 66
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 40,000
Brief Descript ion
• This is a 200 speaker conversational telephony database
• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls
each, to a pool of 100 call receivers
• Approximately 33 hours of conversation data (equivalent to 66 hours of single
• channel audio).
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
50
Databases
Language Hebrew
DB Name HEB_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 200
Prompts per speaker
Total utterances/Entr ies
Audio Hours 69
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 30,000
Brief Descript ion
• This is a 200 speaker conversational telephony database
• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls
each (1 from landline, 1 from mobile), to a pool of 100 call receivers
• Approximately 34 hours of conversation data (equivalent to 69 hours of single
• channel audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
51
Databases
Language Hindi
DB Name HIN_ASR001
DB type 1 ASR
DB type 2 Telephony
Environments Low background noise
Speakers 1,920
Prompts per speaker 50
Total utterances/Entr ies 96,000
Audio Hours 224
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 45,000
Brief Descript ion
• This is a 1,920 speaker Hindi mobile telephony speech database. The database comprises
1,920 speakers who speak Hindi as a second language (i.e. native speakers of Telugu,
Gujarati, etc who use Hindi as a second language) recorded on location in India
• Database Type
o 1,920 speakers recorded in India
o 50% male, 50% female
o Broad distribution of age groups (16-60 years) and dialects
o Medium Level background noise - in-car, home/office, roadside and other public
place type environments
• Language Materials
o 50 prompts per speaker, including Digits; Natural Numbers; Personal, Place and
Business names; Confirmation items (yes, no + fuzzy); Generic Command and
Control items; Phonetically rich Sentences and Words; and Web addresses
• Transcriptions
o Fully transcribed to SpeechDAT type conventions
• Lexicon
o Database is accompanied by a pronunciation lexicon [SAMPA] containing all
transcribed words
o Lexicon - 9,853 unique headwords
o Total audio length - Approximately 224 hours
Contact Appen for further information
Data
base
s - D
etai
led
52
Databases
Language Hindi
DB Name HIN_ASR002
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Mixed
Speakers 996
Prompts per speaker
Total utterances/Entr ies
Audio Hours 65
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 45,000
Brief Descript ion
• This is a 996 speaker conversational** telephony database
• Approximately 65 hours of conversation data (equivalent to 60 hours of single
channel audio)
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed.
For a smaller number of calls, only one half of the conversation was collected and
transcribed
Contact Appen for further information
Data
base
s - D
etai
led
53
Databases
Language Italian
DB Name ITA_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Mixed
Speakers 200
Prompts per speaker 200
Total utterances/Entr ies 40,000
Audio Hours 177
Sampling rate - kHz 22.05
Recording channels 4
List Pr ice USD 12,500
Brief Descript ion
• This is a 200 speaker microphone recorded database
• Each speaker read 200 utterances:
o 100 - command and control type items
o 100 - phonetically rich sentences
• Fully transcribed to SpeechDAT type conventions
• Database is accompanied by a pronunciation lexicon containing all transcribed words
• Lexicon - 7,316 unique headwords
• Total audio length - 177 hours
Contact Appen for further information
Data
base
s - D
etai
led
54
Databases
Language Italian
DB Name ITA_ASR002
DB type 1 ASR
DB type 2 Microphone
Environments In-Car
Speakers 103
Prompts per speaker 350
Audio Hours 189
Total utterances/Entr ies 35,875
Sampling rate - kHz 48.00
Recording channels 4
List Pr ice USD 19,500
Brief Descript ion
• This is a 205 session In-Car database
• Each speaker recorded 1or 2 sessions:
o Session 1 in a parked vehicle with the engine running
o Session 2 in a vehicle travelling at 60 mph (100 km/h).
• 350 prompts were read by each speaker (175) per session) including:
o Digits
o Street names
o Generic Command and Control items
o Phonetically rich Sentences and Words
• Fully transcribed to SpeechDAT type conventions
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
55
Databases
Language Italian
DB Name ITA_ASR003
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 200
Prompts per speaker
Total utterances/Entr ies
Audio Hours 72
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 30,000
Brief Descript ion
• This is a 200 speaker conversational telephony database
• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls
each (1 from landline, 1 from mobile), to a pool of 100 call receivers
• Approximately 36 hours of conversation data (equivalent to 72 hours of single channel
audio)
• Database is fully transcribed and timestamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
56
Databases
Language Italian
DB Name ITA_ASR004
DB type 1 ASR
DB type 2 Voicemail Telephony
Environments Low background noise
Speakers 550
Prompts per speaker
Total utterances/Entr ies
Audio Hours 123
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 37,500
Brief Descript ion
• This is a 550 speaker voicemail telephony database
• Broad distribution of age, gender and landline/mobile coverage
• Provides good representation of key accents across Italy
• Approximately 123 audio hours of voicemail data
• The database covers speakers providing spontaneous voicemail type responses selected
from a pool of approx. 200 common voicemail scenarios (e.g. Leave a message to tell your
colleague that you are running late for a meeting)
• The database is fully transcribed and is accompanied by a pronunciation lexicon
containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
57
Databases
Language Italian
DB Name ITA_TTS001
DB type 1 TTS
DB type 2 Microphone
Environments Studio
Speakers 1
Prompts per speaker 3,300
Total utterances/Entr ies 3,300
Audio Hours 3
Sampling rate - kHz 22.05
Recording channels 1
List Pr ice USD 11,500
Brief Descript ion
• This is a single speaker TTS speech database. The database comprises 3,300 phonetically
rich sentences recorded by a male Italian speaker in a studio environment. The database is
accompanied by a pronunciation lexicon containing an entry for each of the words spoken
in the database
Contact Appen for further information
Data
base
s - D
etai
led
58
Databases
Language Japanese
DB Name JPN_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 144
Prompts per speaker
Total utterances/Entr ies 13,067
Audio Hours 33
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 144 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
59
Databases
Language Kannada
DB Name KAN_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Mixed
Speakers 1,000
Prompts per speaker
Total utterances/Entr ies
Audio Hours 30
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 45,000
Brief Descript ion
• This is a 1,000 speaker conversational telephony database
• Approximately 30 hours of conversation data (equivalent to 60 hours of single channel
audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
60
Databases
Language Korean
DB Name KOR_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 100
Prompts per speaker
Total utterances/Entr ies 8,107
Audio Hours 20
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 100 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
61
Databases
Language Mandarin
DB Name MAC_ASR001
DB type 1 ASR
DB type 2 Telephony
Environments Mixed
Speakers 2,000
Prompts per speaker 100
Total utterances/Entr ies 200,000
Audio Hours 115
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 45,000
Brief Descript ion
• This is a 2,000 speaker Mandarin mobile telephony speech data collection
• The database comprises 2,000 Mandarin speakers recorded on location in China
• 2,000 speakers recorded in China
• 50% male, 50% female
• 100% Mobile Telephony
• Broad distribution of age groups (16-60 years) Language Materials
• 100 prompts per speaker, including:
o Digits
o Natural Numbers
o Personal, place, and business names
o Confirmation items (yes, no + fuzzy)
o Generic Command and Control items
o Phonetically rich Sentences and Words
• Transcriptions
• Fully transcribed to SpeechDAT type conventions
• Lexicon
• Database is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed
words
Contact Appen for further information
Data
base
s - D
etai
led
62
Databases
Language Mandarin
DB Name MAC_ASR002
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 132
Prompts per speaker
Total utterances/Entr ies 10,225
Audio Hours 26
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 132 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
63
Databases
Language Marathi
DB Name MAR_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Mixed
Speakers 1,000
Prompts per speaker
Total utterances/Entr ies
Audio Hours 30
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 45,000
Brief Descript ion
• This is a 1,000 speaker conversational telephony database
• Approximately 15 hours of conversation data (equivalent to 30 hours of single channel
audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
64
Databases
Language Pashto
DB Name PAS_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 967
Prompts per speaker
Total utterances/Entr ies
Audio Hours 111
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 65,000
Brief Descript ion
• This is a 967 speaker conversational** telephony database
• Approximately 55 hours of conversation data (equivalent to 111 hours of single channel
audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
• For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For
a smaller number of calls, only one half of the conversation was collected and transcribed
Contact Appen for further information
Data
base
s - D
etai
led
65
Databases
Language Pashto
DB Name PAS_ASR002
DB type 1 ASR
DB type 2 Conversational microphone data
Environments Low background noise
Number of sessions 40
Average session length 120 minutes
Total utterances/Entr ies N/A
Audio Hours 80
Sampling rate - kHz 16.00
Recording channels 2
L ist Pr ice USD 75,000
Br ief Descript ion
• Each recording consists of a number of TransTAC style dialogues (monolingual 2-way
conversations). One speaker acts as an interviewer and the other as the interviewee
• The interviewer appears in more than one set of dialogues but the interviewee is unique for
each set
• Data collection scenarios are similar to TransTAC style (e.g. civil affairs, checkpoints etc.)
• Demographic information is as follows:
o Roughly 25% female and 75% male speakers
o Broad range of ages from 18 years – 55 years
o Broad distribution across two dialect regions in Afghanistan
• 40 hours of conversation data (equivalent to 80 hours of single channel audio)
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
• A full translation of the transcripts into French is also available as an optional additional
purchase
Contact Appen for further information
Data
base
s - D
etai
led
66
Databases
Language Pashto
DB Name PAS_BRC001
DB type 1 Broadcast
DB type 2 Broadcast Data
Environments Broadcast Data
Speakers
Prompts per speaker
Total utterances/Entr ies
Audio Hours 51
Sampling rate - kHz 0.00
Recording channels 1
List Pr ice USD 22,500
Brief Descript ion
• Database contains 50 hours of Pashto broadcast data
• Database is largely speech only and does not include music or advertisements
• Data types include:
o Talk shows
o Interviews
o News broadcasts (excluding news reading by anchors)
• Database is fully transcribed and timestamped
Contact Appen for further information
Data
base
s - D
etai
led
67
Databases
Language Polish
DB Name POL_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 99
Prompts per speaker
Total utterances/Entr ies 10,130
Audio Hours 25
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 99 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
68
Databases
Language Portuguese (Brazilian)
DB Name PTB_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 102
Prompts per speaker
Total utterances/Entr ies 10,417
Audio Hours 26
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 102 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
69
Databases
Language Portuguese (Brazilian)
DB Name PTB_ASR002
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 200
Prompts per speaker
Total utterances/Entr ies
Audio Hours 66
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 35,000
Brief Descript ion
• This is a 300 speaker conversational telephony database. For this project (some speakers
have participated in up to 2 calls)
• Approximately 33 hours of conversation data (equivalent to 66 hours of single channel
audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
70
Databases
Language Portuguese (European)
DB Name PTP_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 200
Prompts per speaker
Total utterances/Entr ies
Audio Hours 72
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 30,000
Brief Descript ion
• This is a 200 speaker conversational telephony database
• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls
each (1 from landline, 1 from mobile), to a pool of 100 call receivers
• Approximately 36 hours of conversation data (equivalent to 72 hours of single channel
audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
71
Databases
Language Romanian
DB Name ROM_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 200
Prompts per speaker
Total utterances/Entr ies
Audio Hours 74
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 30,000
Brief Descript ion
• This is a 200 speaker conversational telephony database
• 200 telephony conversations are recorded for this project – 100 speakers make 2 calls
each (1 from landline, 1 from mobile), to a pool of 100 call receivers
• Approximately 37 hours of conversation data (equivalent to 74 hours of single channel
audio)
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
72
Databases
Language Russian
DB Name RUS_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 200
Prompts per speaker
Total utterances/Entr ies
Audio Hours 74
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 30,000
Brief Descript ion
• This is a 200 speaker conversational telephony database
• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls
each (1 from landline, 1 from mobile), to a pool of 100 call receivers
• Approximately 37 hours of conversation data (equivalent to 74 hours of single channel
audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
73
Databases
Language Russian
DB Name RUS_ASR002
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 115
Prompts per speaker
Total utterances/Entr ies 12,205
Audio Hours 31
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 115 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
74
Databases
Language Somali
DB Name SOM_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 1,000
Prompts per speaker
Total utterances/Entr ies
Audio Hours 101
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 65,000
Brief Descript ion
• This is a 1,000 speaker conversational telephony database
• Approximately 50 hours of conversation data (equivalent to 101 hours of single channel
audio)
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
75
Databases
Language Sorani (Kurdish)
DB Name SOR_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 170
Prompts per speaker
Total utterances/Entr ies
Audio Hours 11
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 30,000
Brief Descript ion
• This is a 170 speaker conversational** telephony database
• Approximately 5 hours of conversation data (equivalent to 11 hours of single channel
audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
• For a large proportion of calls, only one half of the conversation was collected and
transcribed
Contact Appen for further information
Data
base
s - D
etai
led
76
Databases
Language Spanish (European)
DB Name ESP_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Mixed
Speakers 200
Prompts per speaker 200
Total utterances/Entr ies 40,000
Audio Hours 159
Sampling rate - kHz 22.05
Recording channels 4
List Pr ice USD 12,500
Brief Descript ion
• This is a 200 speaker microphone recorded database
• Each speaker read 200 utterances:
o 100 - command and control type items
o 100 - phonetically rich sentences
• Fully transcribed to SpeechDAT type conventions
• Database is accompanied by a pronunciation lexicon containing all transcribed words
• Lexicon - 6,367 unique headwords
• Total audio length - 159 hours
Contact Appen for further information
Data
base
s - D
etai
led
77
Databases
Language Spanish (European)
DB Name ESP_ASR002
DB type 1 ASR
DB type 2 Voicemail Telephony
Environments Low background noise
Speakers 512
Prompts per speaker
Total utterances/Entr ies
Audio Hours 97
Sampling rate - kHz 8.00
Recording channels 1
List Pr ice USD 37,500
Brief Descript ion
• This is a 512 speaker voicemail telephony database
• Broad distribution of age, gender and landline/mobile coverage
• Provides good representation of key accents across Spain
• Approximately 97 audio hours of voicemail data
• The database covers speakers providing spontaneous voicemail type responses selected
from a pool of approx. 200 common voicemail scenarios (e.g. Leave a message to tell your
colleague that you are running late for a meeting)
• The database is fully transcribed and is accompanied by a pronunciation lexicon
containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
78
Databases
Language Spanish (European)
DB Name ESP_TTS001
DB type 1 TTS
DB type 2 Microphone
Environments Studio
Speakers 1
Prompts per speaker 1,787
Total utterances/Entr ies 1,787
Audio Hours 1
Sampling rate - kHz 22.05
Recording channels 1
List Pr ice USD 6,000
Brief Descript ion
• This is a single speaker TTS speech database. The database comprises 1,786 phonetically
rich sentences recorded by a male Spanish speaker in a studio environment. The database
is accompanied by a pronunciation lexicon containing an entry for each of the words
spoken in the database
Contact Appen for further information
Data
base
s - D
etai
led
79
Databases
Language Spanish (Latin America)
DB Name ESL_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 100
Prompts per speaker
Total utterances/Entr ies 6,898
Audio Hours 17
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 100 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
80
Databases
Language Swedish
DB Name SWE_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 98
Prompts per speaker
Total utterances/Entr ies 11,816
Audio Hours 30
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 98 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
81
Databases
Language Thai
DB Name THA_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 98
Prompts per speaker
Total utterances/Entr ies 14,039
Audio Hours 35
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 98 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
82
Databases
Language Turkish
DB Name TUR_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Low background noise
Speakers 200
Prompts per speaker
Total utterances/Entr ies
Audio Hours 83
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 30,000
Brief Descript ion
• This is a 200 speaker conversational telephony database
• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls
each (1 from landline, 1 from mobile), to a pool of 100 call receivers
• Approximately 41 hours of conversation data (equivalent to 83 hours of single channel
audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
83
Databases
Language Turkish
DB Name TUR_ASR002
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 100
Prompts per speaker
Total utterances/Entr ies 6,950
Audio Hours 17
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 100 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Data
base
s - D
etai
led
84
Databases
Language Urdu
DB Name URD_ASR001
DB type 1 ASR
DB type 2 Conversational Telephony
Environments Mixed
Speakers 1,000
Prompts per speaker
Total utterances/Entr ies
Audio Hours 95
Sampling rate - kHz 8.00
Recording channels 2
List Pr ice USD 45,000
Brief Descript ion
• This is a 1,000 speaker conversational telephony database recorded by native Urdu
speakers in Pakistan (700 speakers) and India (300 speakers)
• Approximately 47 hours of conversation data (equivalent to 95 hours of single channel
audio).
• Database is fully transcribed and time stamped
• Database is accompanied by a pronunciation lexicon containing all transcribed words
Contact Appen for further information
Data
base
s - D
etai
led
85
Databases
Language Vietnamese
DB Name VIE_ASR001
DB type 1 ASR
DB type 2 Microphone
Environments Home/office
Speakers 129
Prompts per speaker
Total utterances/Entr ies 18,842
Audio Hours 47
Sampling rate - kHz 16.00
Recording channels 1
List Pr ice EUR 3,600
Brief Descript ion
• This is a 129 speaker microphone recorded database, Global Phone, developed in
collaboration with the Karlsruhe Institute of Technology (KIT)
• Each speaker reads a number of phonetically rich sentences
• The read texts were selected from national newspaper articles available from the web to
cover a wide domain with large vocabulary
• Gender 50% male, 50% female
• Broad distribution of speakers in the age group of 18-70 years
• All speakers were recorded in a home/office type environment
• Database is fully transcribed and the transcription is available both in original script and in
Romanized form (where applicable)
Contact Appen for further information
Lexica
86
Lexica
OverviewAppen Butler Hill has considerable experience in providing a variety of lexicon types. These include
• Pronunciat ion Lexica providing phonemic representation, syllabification, and stress (primary andsecondary as appropriate)
• Part-of-speech tagged Lexica providing grammatical and semantic labels• Other reference text based materia ls including spelling/mis-spelling lists, spell-check
dictionaries, mappings of colloquial language to standard forms, orthographic normalisation lists.
Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages (please see language list below).
Domain CoverageTypical domains covered in our off-the-shelf holdings for a given language include:
• General Vocabulary• Geographical Names e.g. Place Names (City, State, Suburb)• Numbers (0-10,000)• Person Names (both Given and Family)
Lexica can be developed from a wordl ist provided by the cl ient or by Appen Butler Hi l l . I f acl ient requires vocabulary of a specif ic nature or to cover a specif ic domain, this cantypical ly be provided under the same license and pricing terms as our pre-exist ing (off- the-shelf) holdings.
Lexicon Structure• Our Lexica are usually created using a SAMPA phone set for the language which aligns SAMPA
symbols with IPA equivalents. We can convert to most other machine readable formats on request• We also include documentation files which include phone set definitions, statistical notes about phone
coverage within a given Lexicon, and may include background information on data quality andvalidation.
Lexica are typically delivered as text files consisting of three or four tab-delimited fields:Field 1 - HeadwordField 2 - SAMPA pronunciationField 3 - Variant Rank (0 = preferred pronunciation; 1 = also heard, less common)Field 4 - Label e.g. (FAMILY_NAME, GIVEN_NAME, COMMON_WORD…etc.)
In addition to the phonemic mark-up, our Lexica are marked up for primary and secondary stress and forsyllabification where applicable. They will also include pronunciation variants where relevant.
LexiconCategory
Brief Descript ionLicense Priceper headword
(USD)
1 Most languages using Latin based orthographies USD 0.335
2Languages requiring tone mark-up (e.g. Mandarin, Cantonese) and languages
requiring multiple representational forms in the orthography (e.g. Japanese)USD 0.415
3 Languages requiring full diacritization/vowelization (e.g.Arabic) USD 0.460
Pric ing for special ized Languages and Part-of-Speech Tagged Lexica can be provided on
request.
Lexica
Overview Appen has considerable experience in providing a variety of lexicon types. These include:
• Pronunciat ion Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)
• Part-of-speech tagged Lexica providing grammatical and semantic labels • Other reference text based materia ls including spelling/mis-spelling lists, spell-check
dictionaries, mappings of colloquial language to standard forms, orthographic normalisation lists.
Over a period of 15 years, Appen has generated a significant volume of licensable material for a wide range of languages (please see language list below). Domain Coverage Typical domains covered in our off-the-shelf holdings for a given language include:
• General Vocabulary • Geographical Names e.g. Place Names (City, State, Suburb) • Numbers (0-10,000) • Person Names (both Given and Family)
Lexica can be developed from a wordl ist provided by the cl ient or by Appen. I f a cl ient requires vocabulary of a specif ic nature or to cover a specif ic domain, th is can typical ly be provided under the same l icense and pric ing terms as our pre-exist ing (off- the-shelf) holdings. Lexicon Structure
• Our Lexica are usually created using a SAMPA phone set for the language which aligns SAMPA symbols with IPA equivalents. We can convert to most other machine readable formats on request
• We also include documentation files which include phone set definitions, statistical notes about phone coverage within a given Lexicon, and may include background information on data quality and validation.
Lexica are typically delivered as text files consisting of three or four tab-delimited fields: Field 1 - Headword Field 2 - SAMPA pronunciation Field 3 - Variant Rank (0 = preferred pronunciation; 1 = also heard, less common) Field 4 - Label e.g. (FAMILY_NAME, GIVEN_NAME, COMMON_WORD…etc.) In addition to the phonemic mark-up, our Lexica are marked up for primary and secondary stress and for syllabification where applicable. They will also include pronunciation variants where relevant.
Lexicon Category
Brief Descript ion
License Price per headword
(USD)
1 Most languages using Latin based orthographies USD 0.335
2 Languages requiring tone mark-up (e.g. Mandarin, Cantonese) and languages
requiring multiple representational forms in the orthography (e.g. Japanese) USD 0.415
3 Languages requiring full diacritization/vowelization (e.g.Arabic) USD 0.460
Lexica
87
Lexica
Number of headwords
New offerings are frequently added. For holdings information in a given language or to discuss any
customized development efforts, please contact:
sales@appen.com
appen.com
0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
English (Canadian)
English (Australian)
Dutch
Dari
Danish
Czech
Croa>an
Catalan
Cantonese
Bulgarian
Bengali
Basque
Bahasa Malay
Bahasa Indonesia
Arabic (UAE)
Arabic (Syrian)
Arabic (South Levan>ne)
Arabic (Pales>nian)
Arabic (North Levan>ne)
Arabic (MSA)
Arabic (Maghrebi)
Arabic (Iraqi)
Arabic (Gulf)
Arabic (Egyp>an)
Arabic (Algerian)
Assamese
>75,000
>55,000
>100,000
>110,000
>75,000
>70,000
Lexica
88
Lexica
Number of headwords
New offerings are frequently added. For holdings information in a given language or to discuss any
customized development efforts, please contact:
sales@appen.com
appen.com
0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
Norwegian
Marathi
Mandarin
Malayalam
Korean
Kannada
Japanese
Italian
Hungarian
Hindi
Hebrew
Hausa
Greek
German (Switzerland)
German (Austria)
German
French (Switzerland)
French (Luxembourg)
French (European)
French (Canadian)
French (Belgian)
Finnish
English (US)
English (UK)
English (New Zealand)
English (Indian)
>155,000
>85,000
>60,000
>110,000
>55,000
>190,000
>260,000
>100,000
>115,000
>200,000
Lexica
89
Lexica
Number of headwords
New offerings are frequently added. For holdings information in a given language or to discuss any
customized development efforts, please contact:
sales@appen.com
appen.com
>250,000
>100,000
>100,000
>90,000
>115,000
>100,000
>50,000
0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
Xiang
Wu
Vietnamese
Urdu
Ukrainian
Turkish
Thai
Telugu
Tamil
Tagalog
Sylhe?
Swedish
Swahili(Kenya)
Spanish (Mexican)
Spanish (EU -‐ Cas?lian)
Spanish (American -‐ US)
Spanish (All La?n America)
Sorani (Kurdish)
Somali
Serbian
Russian
Romanian
Portuguese (EU)
Portuguese (Brazil)
Polish
Persian/Farsi
Pashto
Oriya
>100,000
>100,000
Oth
er R
esou
rces
90
Other Language Resources Apart from speech databases and lexica, Appen also has a range of other language resources
available for license, which can be found in this section. These resources include:
1. Text Corpora — We have a wide variety of text collections in different languages
available for license. Apart from the Vowelized Arabic Corpus, Appen also has a range of
Named Entity annotated texts. These are corpora of 500,000 words of news text that have
been annotated for persons, titles, quantities, geopolitical entities, locations, facilities, etc.
2. Morphological Analyzers — Our morphological analyzers are designed to generate
grammatically acceptable words using tagged stem dictionaries and information on
inflectional affixes and their combinations. They can manipulate text from languages with
non-Latin scripts and currently generate Urdu and Persian, including informal written
variants of affixes.
3. Thesaurus — Appen can undertake thesaurus development in several ways: from first
principles, as an extension to existing work or as validation of an existing thesaurus, with
consistency and coverage an important focus. Because each language is subtly different
and requires deep grammatical analysis to produce a quality product, native speakers are
always used to build a thesaurus. Appen can produce thesauri to client specifications as a
licensable database which is supplied in a standard XML format or to client specifications.
4. Language Analysis Documentat ion — Appen can provide comprehensive language
analysis documents under license for all languages of interest. These documents support
system and application developers and include phonological features and processes,
analysis of
Romanization schemes (where applicable), regional and dialectal differences and
population statistics of speakers. Appen can also provide analysis and recommendations
on specific collections for a nominated language.
Oth
er R
esou
rces
91
Language Analysis Documents
Language DB Name List Pr ice Brief Descript ion Arabic (Iraqi) ARB_LAN001
USD 2,500
(per language)
The key topics that are typically covered in the language analysis document include:
• General Information about the
country • General Information about the
language • Language classification of the
language • Other Languages spoken in the
country • History of the language (where
relevant) including changes due to immigration etc
• Dialects of the language • maps indicating dialect regions • discussion of dialects –
distribution,
features etc.
• recommendations on a dialect distribution that would be feasible to use in a speech data collection
• Sound System of the language • Relevant Phonological Processes
prevalent in the language/country • Orthographic Conventions for the
language • Communications
Arabic
(North
Levantine)
ARB_LAN002
Bahasa
Indonesia BAH_LAN001
Brazilian
Portuguese PTB_LAN001
Croatian CRO_LAN001
Dari DAR_LAN001
English (US) ENG_LAN001
Farsi/Persian FAR_LAN001
French
(Canadian) FRC_LAN001
German DEU_LAN001
Hebrew HEB_LAN001
Japanese JAP_LAN001
Korean KOR_LAN001
Mandarin MAC_LAN001
Pashto PAS_LAN001
Russian RUS_LAN001
Serbian SRB_LAN001
Sorani (Kurdish) SOR_LAN001
Thai THA_LAN001
Urdu URD_LAN001
Oth
er R
esou
rces
92
NER Corpora
Language DB Name Words List Pr ice Brief Descript ion
Arabic ARB_NER001
500,000
(per language)
USD 7,500
(per language)
Corpora containing text material collected from a variety of sources.
Each Text Corpus contains approximately
500,000 words and is
tagged for the following Named
Entities:
- Person
- Organization
- Location
- Nationality
- Religion
- Facility
- Geo-Political Entity
- Titles
English ENG_NER001
Farsi/
Persian FAR_NER001
Japanese JPY_NER001
Korean KOR_NER001
Mandarin MAC_NER001
Russian RUS_NER001
Urdu URD_NER001
Oth
er R
esou
rces
93
Text Corpora Language Arabic (MSA)
DB Name ARB_THE001
DB type 2 Thesaurus
Words 28,000
List Pr ice Provided on request
Br ief Descript ion:
• The thesaurus contains 28,000 headwords
• For each headword, the following information is provided:
o Detailed Part-Of-Speech information including Verb (Intransitive/Transitive),
• Adverb, Noun, Adjective
o A broad definition in English
o Synonyms
o Antonyms
o A broad definition of the antonym group linked to the sense group
Oth
er R
esou
rces
94
Text Corpora Language Arabic (MSA)
DB Name ARB_TXT001
DB type 2 Vowelized text corpus
Words 450,000
List Pr ice USD 9,500
Brief Descript ion:
• This vowelised corpus is made up of 450,000 words of Arabic news text
• The text has been 100% manually vowelised and checked
Oth
er R
esou
rces
95
Text Corpora Language Farsi/Persian
DB Name FAR_MOR001
DB type 2 Morphological Database
Words 0
List Pr ice USD 32,500
Brief Descript ion:
• The Farsi/Persian morphological database comprises six files in text format:
-‐ a stems dictionary;
-‐ a dictionary of inflectional prefixes;
-‐ a dictionary of inflectional suffixes; and
-‐ three compatibility tables, which define the grammatically acceptable combinations
of stems, prefixes and suffixes for any given stem in the stems dictionary (prefix-
suffix; prefix-stem; suffix-stem).
• The format of the six files corresponds to the input format required by the Buckwalter
AraGen generation program. This program uses the input file to output the complete set of
potential words defined by the stem and affix dictionaries and compatibility tables
• All words and affixes in the six files are in a Romanized form (converted using an Appen
conversion table). Each word and affix is shown with and without short vowels. The form
with short vowels (the vowelized form) reflects the pronunciation of the word or affix
• SUMMARY OF CONTENTS
-‐ Stems in stem dictionary - 18,364 (including stem alternations)
-‐ Stems in stem dictionary - 16,492 (excluding stem alternations)
-‐ Number of suffixes: 506 (including zero suffix and variants of suffixes with
and without the zero width non-joiner character)
-‐ Number of prefixes: 14 (including zero prefix)
-‐ Number of unique words generated: 1,608,559
Oth
er R
esou
rces
96
Text Corpora Language Urdu
DB Name URD_MOR001
DB type 2 Morphological Database
Words 0
List Pr ice USD 32,500
Brief Descript ion:
• The Urdu morphological database comprises six files in text format:
-‐ a stems dictionary;
-‐ a dictionary of inflectional prefixes;
-‐ a dictionary of inflectional suffixes; and
-‐ three compatibility tables, which define the grammatically acceptable combinations
of stems, prefixes and suffixes for any given stem in the stems dictionary (prefix-
suffix; prefix-stem; suffix-stem)
• The format of the six files corresponds to the input format required by the Buckwalter
AraGen generation program. This program uses the input file to output the complete set of
potential words defined by the stem and affix dictionaries and compatibility tables.
• All words and affixes in the six files are in a Romanized form (converted using an Appen
conversion table). Each word and affix is shown with and without short vowels. The form
with short vowels (the vowelized form) reflects the pronunciation of the word or affix.
• SUMMARY OF CONTENTS
-‐ Stems in stem dictionary - 13,267 (including stem alternations)
-‐ Stems in stem dictionary - 13,116 (excluding stem alternations)
-‐ Number of suffixes: 115 (including zero suffix)
-‐ Number of prefixes: 1 (zero prefix)
-‐ Number of unique words generated: 31,109
TM
Contact detailsAppen Pty Ltd
Level 69 Help Street
Chatswood, SydneyNSW 2067 Australia
Enquiries:
Sydney office: +61-2-9468-6335US sales enquiries: +1-315-335-4020
Europe: +31-622-799-535Japan & Korea: +1-202-765-7106
China: +61-2-9468-6310
sales@appen.com
www.appen.com
LanguageResources Catalog