Incorporating Dialectal Variability for Socially Equitable...

Incorporating Dialectal Variability for Socially Equitable

Language IdentificationDavid Jurgens, Yulia Tsvetkov, and Dan Jurafsky

McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal

of Computing Sciences in Colleges 20(3) 2005.

“This paper describes […] how even the most simple of these methods

using data obtained from the World Wide Web achieve accuracy

approaching 100% on a test suite comprised of ten European languages”

McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal

of Computing Sciences in Colleges 20(3) 2005.

Whose language are we identifying?

Global platforms attract global diversity in a language

English

English 125M Speakers

90M Speakers

79M Speakers

60M Speakers

251M Speakers

English

French Spanish Arabic

125M Speakers

90M Speakers

79M Speakers

60M Speakers

251M Speakers

Human Development Index of text’s origin country

Estimated LID accuracy for

English tweets

{EducationLife expectancy Income

(Labov, 1964; Ash, 2002)

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Current language detection methods perform significantly worse in less-developed countries

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

English tweets }23%

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

NLP Which symptoms?

LanguageDetection

non-English

NLP Which symptoms?

LanguageDetection

non-English

NLP Which symptoms?

LanguageDetection

non-English

NLP Which symptoms?

LanguageDetection

NLP Which symptoms?

LanguageDetection

non-English?

Failing to recognize a language silences its

speakers’ voices

Estimated accuracy for

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Our goal is make language ID performance equal for all

languages across all dialects

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Our goal is make language ID performance equal for all

languages across all dialects

This is a

universal

NLP issue!

Key Problems: Current methods struggle in the global setting because

Data: No corpora that captures global variation in lexicon and dialect

Key Problems: Current methods struggle in the global setting because

Data: No corpora that captures global variation in lexicon and dialect

Model: makes simplistic assumptions about how multilinguals communicate

Our approach

NLP methodologies capable of handling linguistic variation

Better social representation through network-based

sampling

Our Data Solution: Improve linguistic representation through network-based sampling

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

engeng

engfra

Sample from the geolocated Twitter social network to include text from people at all locations

engeng

engfra

Build a strategically-diverse corpora and synthesize code-switched examples

Topical

Topical Geographic

Topical

Social

Geographic

Topical

Social

Geographic

Multilingual

Our model solution: treat language identification as a character-based sequence to sequence task.

Encoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.Jaech et al. 2016; Samih et al. 2016

Encoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.Jaech et al. 2016; Samih et al. 2016

Represents a multi-layer recurrent neural network

Encoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…

Jaech et al. 2016; Samih et al. 2016

Encodes the whole sentence using its charactersEncoder

Decoder

Encodes the whole sentence using its charactersEncoder

Decoder

Decode each word’s language from the sentence encoding

Encodes the whole sentence using its characters

Fra Fra Fra Fra Fra . Eng Eng Eng Eng Eng .

Encoder

Decoder

Decode each word’s language from the sentence encoding

Equilid vs off-the-shelf

Lui et al. 2013, 2014

70 Languages on Twitter

d.py CL

Lui et al. 2013, 2014

d.py CL

Geo-diverse Tweets

CLD2 Ou

Lui et al. 2013, 2014

d.py CL

Geo-diverse Tweets

CLD2 Ou

Multilingual Tweets

lot CLD2

Lui et al. 2013, 2014

Equilid even outperforms system specifically tuned for each dataset

9291.2M

d.py CL

TweetLID

79.678.7

Case Study: Do our solutions provide socially-equitable language identification for

health-related queries?

1M Tweets with any of 385 English terms from

established lexicons for influenza, psychological well-

being, and social health

Lamb et al., (2013); Smith et al., (2016); Preotiuc-Pietro et al., (2015); Park et al., (2016)16

Task: does the language identification system recognize every tweet as English?

Equilid raises the bar for socially-equitable language identification

Social Equality doesn’t stop at Language Identification

Methodologies capable of handling

language as it is used

Better social representation in

our data

Social Equality doesn’t stop at Language Identification

Methodologies capable of handling

language as it is used

Better social representation in

our data

David Jurgens, Yulia Tsvetkov, and Dan Jurafsky

Be equitable! https://github.com/davidjurgens/equilid

Incorporating Dialectal Variability for Socially Equitable...

Documents

Transcript of Incorporating Dialectal Variability for Socially Equitable...

Dialectal Difffghjkl;erentiation

L’alternance codique arabe dialectal/français dans des ... · L’alternance codique arabe dialectal/français dans des conversations bilingues de locuteurs algériens immigrés/non-immigrés

Dialectal variation in Coptic possessive constructions … · Dialectal variation in Coptic possessive constructions ... Coptic Encyclopedia ... Sahidic Coptic is the form of the

RAISING DIALECTAL AWARENESS IN SPANISH AS A FOREIGN ...€¦ · Raising dialectal awareness in spanish as a foreign language courses CAUCE. Revista Internacional de Filología, Comunicación

Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Dialectal Behavioral 1 Running head: Dialectical Behavioral

The Importance of Dialectal Variation in Kerala Curriculum ...

Cross-linking Austrian dialectal Dictionaries through ... · WBÖ is a dialectal dictionary for all Bavarian dialects in the former Austrian Hungarian Monarchy (status: about 1915),

Phonological Variation in Multi-Dialectal Italy ... · NWAV 36, University of Pennsylvania, Philadelphia, October 11-14, 2007 1 Phonological Variation in Multi-Dialectal Italy: distinguishing

Modeling)Dialectal)Dic.onaries)for) their)Publicaon in the ...

Sample workshop: Understanding dialectal differences

BIZKAIFON: A sound archive of dialectal varieties of ...

Laura Stiegelmaier Michael Giacchetto Bill Hender Matthew Jurgens

Understanding dialectal differences Facilitator AFacilitator B Location Date.

Dialectal Arabic Telephone Speech Corpus: Principles, Tool ......Dialectal Arabic Telephone Speech Corpus: Principles, Tool design, and Transcription Conventions Mohamed Maamouri,

Dialectal Borrowing

Mapping Dialectal Variation Using the Algonquian ...

Scriptie C Jurgens

Automatic Speech Recognition for Dialectal ArabicAutomatic Speech Recognition for Dialectal Arabic PhD Examination Mohamed Ahmed Elmahdy August 26, 2011 Dialogue Systems Group Institute

Project on Cross-Dialectal Comprehension: Gating Experiment 2