Incorporating Dialectal Variability for Socially Equitable...

Post on 08-Apr-2018

220 views 3 download

Transcript of Incorporating Dialectal Variability for Socially Equitable...

Incorporating Dialectal Variability for Socially Equitable

Language IdentificationDavid Jurgens, Yulia Tsvetkov, and Dan Jurafsky

McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal

of Computing Sciences in Colleges 20(3) 2005.

“This paper describes […] how even the most simple of these methods

using data obtained from the World Wide Web achieve accuracy

approaching 100% on a test suite comprised of ten European languages”

McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal

of Computing Sciences in Colleges 20(3) 2005.

Whose language are we identifying?

Whose language are we identifying?

Whose language are we identifying?

Whose language are we identifying?

Global platforms attract global diversity in a language

English

Global platforms attract global diversity in a language

English 125M Speakers

90M Speakers

79M Speakers

60M Speakers

251M Speakers

Global platforms attract global diversity in a language

English

French Spanish Arabic

125M Speakers

90M Speakers

79M Speakers

60M Speakers

251M Speakers

5

Human Development Index of text’s origin country

Estimated LID accuracy for

English tweets

{EducationLife expectancy Income

(Labov, 1964; Ash, 2002)

5

Human Development Index of text’s origin country

Estimated LID accuracy for

English tweets

{EducationLife expectancy Income

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

5

Human Development Index of text’s origin country

Estimated LID accuracy for

English tweets

{EducationLife expectancy Income

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Current language detection methods perform significantly worse in less-developed countries

5

Human Development Index of text’s origin country

Estimated LID accuracy for

English tweets

{EducationLife expectancy Income

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Current language detection methods perform significantly worse in less-developed countries

5

Human Development Index of text’s origin country

Estimated LID accuracy for

English tweets }23%

{EducationLife expectancy Income

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

NLP Which symptoms?

6

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

NLP Which symptoms?

LanguageDetection

6

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

non-English

NLP Which symptoms?

LanguageDetection

6

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

non-English

NLP Which symptoms?

LanguageDetection

6

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

non-English

NLP Which symptoms?

LanguageDetection

6

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

NLP Which symptoms?

LanguageDetection

non-English?

6

Failing to recognize a language silences its

speakers’ voices

Current language detection methods perform significantly worse in less-developed countries

8

Human Development Index of text’s origin country

Estimated accuracy for

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Current language detection methods perform significantly worse in less-developed countries

8

Human Development Index of text’s origin country

Estimated accuracy for

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Our goal is make language ID performance equal for all

languages across all dialects

Current language detection methods perform significantly worse in less-developed countries

8

Human Development Index of text’s origin country

Estimated accuracy for

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Our goal is make language ID performance equal for all

languages across all dialects

This is a

universal

NLP issue!

Key Problems: Current methods struggle in the global setting because

9

Key Problems: Current methods struggle in the global setting because

9

Data: No corpora that captures global variation in lexicon and dialect

Key Problems: Current methods struggle in the global setting because

9

Data: No corpora that captures global variation in lexicon and dialect

Model: makes simplistic assumptions about how multilinguals communicate

Our approach

10

NLP methodologies capable of handling linguistic variation

Better social representation through network-based

sampling

Our Data Solution: Improve linguistic representation through network-based sampling

11

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

eng

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

engeng

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

engeng

eng

engeng

fra

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

engeng

eng

engeng

engfra

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

Sample from the geolocated Twitter social network to include text from people at all locations

engeng

eng

engeng

engfra

Build a strategically-diverse corpora and synthesize code-switched examples

12

Build a strategically-diverse corpora and synthesize code-switched examples

12

Topical

Build a strategically-diverse corpora and synthesize code-switched examples

12

Topical Geographic

Build a strategically-diverse corpora and synthesize code-switched examples

12

Topical

Social

Geographic

Build a strategically-diverse corpora and synthesize code-switched examples

12

Topical

Social

Geographic

Multilingual

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.Jaech et al. 2016; Samih et al. 2016

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.Jaech et al. 2016; Samih et al. 2016

Represents a multi-layer recurrent neural network

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…

Jaech et al. 2016; Samih et al. 2016

Represents a multi-layer recurrent neural network

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encodes the whole sentence using its charactersEncoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…

Jaech et al. 2016; Samih et al. 2016

Represents a multi-layer recurrent neural network

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encodes the whole sentence using its charactersEncoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…

Decode each word’s language from the sentence encoding

Jaech et al. 2016; Samih et al. 2016

Represents a multi-layer recurrent neural network

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encodes the whole sentence using its characters

Fra Fra Fra Fra Fra . Eng Eng Eng Eng Eng .

Encoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…

Decode each word’s language from the sentence encoding

Jaech et al. 2016; Samih et al. 2016

Represents a multi-layer recurrent neural network

14

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our M

etho

d

14

0

25

50

75

100

70 Languages on Twitter

Mac

ro F

1

langi

d.py CL

D2Ou

r Met

hod

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our M

etho

d

14

0

25

50

75

100

70 Languages on Twitter

Mac

ro F

1

langi

d.py CL

D2Ou

r Met

hod

0

25

50

75

100

Geo-diverse Tweets

Mac

ro F

1

langi

d.py

CLD2 Ou

r Met

hod

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our M

etho

d

14

0

25

50

75

100

70 Languages on Twitter

Mac

ro F

1

langi

d.py CL

D2Ou

r Met

hod

0

25

50

75

100

Geo-diverse Tweets

Mac

ro F

1

langi

d.py

CLD2 Ou

r Met

hod

0

25

50

75

100

Multilingual Tweets

Mac

ro F

1

Polyg

lot CLD2

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our M

etho

d

15

Equilid even outperforms system specifically tuned for each dataset

0

50

100

70 Languages on Twitter

9291.2M

acro

F1

Our M

etho

d

Mac

ro F

1

langi

d.py CL

D2Ou

r Met

hod

0

50

100

TweetLID

79.678.7

Jaec

h et

al. (

2016

)

Our M

etho

d

Jaec

h et

al. (

2016

)

Case Study: Do our solutions provide socially-equitable language identification for

health-related queries?

16

1M Tweets with any of 385 English terms from

established lexicons for influenza, psychological well-

being, and social health

Case Study: Do our solutions provide socially-equitable language identification for

health-related queries?

Lamb et al., (2013); Smith et al., (2016); Preotiuc-Pietro et al., (2015); Park et al., (2016)16

1M Tweets with any of 385 English terms from

established lexicons for influenza, psychological well-

being, and social health

Case Study: Do our solutions provide socially-equitable language identification for

health-related queries?

Lamb et al., (2013); Smith et al., (2016); Preotiuc-Pietro et al., (2015); Park et al., (2016)16

Task: does the language identification system recognize every tweet as English?

1M Tweets with any of 385 English terms from

established lexicons for influenza, psychological well-

being, and social health

Equilid raises the bar for socially-equitable language identification

17

Human Development Index of text’s origin country

Estim

ated

acc

urac

y fo

r En

glish

twee

ts

Social Equality doesn’t stop at Language Identification

18

Methodologies capable of handling

language as it is used

Better social representation in

our data

Social Equality doesn’t stop at Language Identification

18

Methodologies capable of handling

language as it is used

Better social representation in

our data

19

David Jurgens, Yulia Tsvetkov, and Dan Jurafsky

Be equitable! https://github.com/davidjurgens/equilid