Incorporating Dialectal Variability for Socially Equitable...

60
Incorporating Dialectal Variability for Socially Equitable Language Identification David Jurgens, Yulia Tsvetkov, and Dan Jurafsky

Transcript of Incorporating Dialectal Variability for Socially Equitable...

Page 1: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Incorporating Dialectal Variability for Socially Equitable

Language IdentificationDavid Jurgens, Yulia Tsvetkov, and Dan Jurafsky

Page 2: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal

of Computing Sciences in Colleges 20(3) 2005.

Page 3: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

“This paper describes […] how even the most simple of these methods

using data obtained from the World Wide Web achieve accuracy

approaching 100% on a test suite comprised of ten European languages”

McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal

of Computing Sciences in Colleges 20(3) 2005.

Page 4: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Whose language are we identifying?

Page 5: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Whose language are we identifying?

Page 6: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Whose language are we identifying?

Page 7: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Whose language are we identifying?

Page 8: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Global platforms attract global diversity in a language

English

Page 9: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Global platforms attract global diversity in a language

English 125M Speakers

90M Speakers

79M Speakers

60M Speakers

251M Speakers

Page 10: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Global platforms attract global diversity in a language

English

French Spanish Arabic

125M Speakers

90M Speakers

79M Speakers

60M Speakers

251M Speakers

Page 11: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

5

Human Development Index of text’s origin country

Estimated LID accuracy for

English tweets

{EducationLife expectancy Income

(Labov, 1964; Ash, 2002)

Page 12: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

5

Human Development Index of text’s origin country

Estimated LID accuracy for

English tweets

{EducationLife expectancy Income

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Page 13: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

5

Human Development Index of text’s origin country

Estimated LID accuracy for

English tweets

{EducationLife expectancy Income

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Page 14: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Current language detection methods perform significantly worse in less-developed countries

5

Human Development Index of text’s origin country

Estimated LID accuracy for

English tweets

{EducationLife expectancy Income

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Page 15: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Current language detection methods perform significantly worse in less-developed countries

5

Human Development Index of text’s origin country

Estimated LID accuracy for

English tweets }23%

{EducationLife expectancy Income

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Page 16: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

NLP Which symptoms?

6

Page 17: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

NLP Which symptoms?

LanguageDetection

6

Page 18: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

non-English

NLP Which symptoms?

LanguageDetection

6

Page 19: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

non-English

NLP Which symptoms?

LanguageDetection

6

Page 20: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

non-English

NLP Which symptoms?

LanguageDetection

6

Page 21: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

6

Keyword Filter“flu”, “sick”

Practical Motivation: Epidemic Detection

NLP Which symptoms?

LanguageDetection

non-English?

6

Page 22: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Failing to recognize a language silences its

speakers’ voices

Page 23: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Current language detection methods perform significantly worse in less-developed countries

8

Human Development Index of text’s origin country

Estimated accuracy for

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Page 24: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Current language detection methods perform significantly worse in less-developed countries

8

Human Development Index of text’s origin country

Estimated accuracy for

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Our goal is make language ID performance equal for all

languages across all dialects

Page 25: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Current language detection methods perform significantly worse in less-developed countries

8

Human Development Index of text’s origin country

Estimated accuracy for

English tweets

MoreDialect

LessDialect

(Labov, 1964; Ash, 2002)

Our goal is make language ID performance equal for all

languages across all dialects

This is a

universal

NLP issue!

Page 26: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Key Problems: Current methods struggle in the global setting because

9

Page 27: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Key Problems: Current methods struggle in the global setting because

9

Data: No corpora that captures global variation in lexicon and dialect

Page 28: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Key Problems: Current methods struggle in the global setting because

9

Data: No corpora that captures global variation in lexicon and dialect

Model: makes simplistic assumptions about how multilinguals communicate

Page 29: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our approach

10

NLP methodologies capable of handling linguistic variation

Better social representation through network-based

sampling

Page 30: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our Data Solution: Improve linguistic representation through network-based sampling

11

Page 31: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

Page 32: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

Page 33: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

eng

Page 34: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

engeng

Page 35: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

engeng

eng

engeng

fra

Page 36: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

engeng

eng

engeng

engfra

Page 37: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

Sample from the geolocated Twitter social network to include text from people at all locations

engeng

eng

engeng

engfra

Page 38: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Build a strategically-diverse corpora and synthesize code-switched examples

12

Page 39: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Build a strategically-diverse corpora and synthesize code-switched examples

12

Topical

Page 40: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Build a strategically-diverse corpora and synthesize code-switched examples

12

Topical Geographic

Page 41: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Build a strategically-diverse corpora and synthesize code-switched examples

12

Topical

Social

Geographic

Page 42: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Build a strategically-diverse corpora and synthesize code-switched examples

12

Topical

Social

Geographic

Multilingual

Page 43: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.Jaech et al. 2016; Samih et al. 2016

Page 44: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.Jaech et al. 2016; Samih et al. 2016

Represents a multi-layer recurrent neural network

Page 45: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…

Jaech et al. 2016; Samih et al. 2016

Represents a multi-layer recurrent neural network

Page 46: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encodes the whole sentence using its charactersEncoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…

Jaech et al. 2016; Samih et al. 2016

Represents a multi-layer recurrent neural network

Page 47: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encodes the whole sentence using its charactersEncoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…

Decode each word’s language from the sentence encoding

Jaech et al. 2016; Samih et al. 2016

Represents a multi-layer recurrent neural network

Page 48: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Our model solution: treat language identification as a character-based sequence to sequence task.

13

Encodes the whole sentence using its characters

Fra Fra Fra Fra Fra . Eng Eng Eng Eng Eng .

Encoder

Decoder

Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…

Decode each word’s language from the sentence encoding

Jaech et al. 2016; Samih et al. 2016

Represents a multi-layer recurrent neural network

Page 49: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

14

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our M

etho

d

Page 50: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

14

0

25

50

75

100

70 Languages on Twitter

Mac

ro F

1

langi

d.py CL

D2Ou

r Met

hod

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our M

etho

d

Page 51: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

14

0

25

50

75

100

70 Languages on Twitter

Mac

ro F

1

langi

d.py CL

D2Ou

r Met

hod

0

25

50

75

100

Geo-diverse Tweets

Mac

ro F

1

langi

d.py

CLD2 Ou

r Met

hod

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our M

etho

d

Page 52: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

14

0

25

50

75

100

70 Languages on Twitter

Mac

ro F

1

langi

d.py CL

D2Ou

r Met

hod

0

25

50

75

100

Geo-diverse Tweets

Mac

ro F

1

langi

d.py

CLD2 Ou

r Met

hod

0

25

50

75

100

Multilingual Tweets

Mac

ro F

1

Polyg

lot CLD2

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our M

etho

d

Page 53: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

15

Equilid even outperforms system specifically tuned for each dataset

0

50

100

70 Languages on Twitter

9291.2M

acro

F1

Our M

etho

d

Mac

ro F

1

langi

d.py CL

D2Ou

r Met

hod

0

50

100

TweetLID

79.678.7

Jaec

h et

al. (

2016

)

Our M

etho

d

Jaec

h et

al. (

2016

)

Page 54: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Case Study: Do our solutions provide socially-equitable language identification for

health-related queries?

16

1M Tweets with any of 385 English terms from

established lexicons for influenza, psychological well-

being, and social health

Page 55: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Case Study: Do our solutions provide socially-equitable language identification for

health-related queries?

Lamb et al., (2013); Smith et al., (2016); Preotiuc-Pietro et al., (2015); Park et al., (2016)16

1M Tweets with any of 385 English terms from

established lexicons for influenza, psychological well-

being, and social health

Page 56: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Case Study: Do our solutions provide socially-equitable language identification for

health-related queries?

Lamb et al., (2013); Smith et al., (2016); Preotiuc-Pietro et al., (2015); Park et al., (2016)16

Task: does the language identification system recognize every tweet as English?

1M Tweets with any of 385 English terms from

established lexicons for influenza, psychological well-

being, and social health

Page 57: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Equilid raises the bar for socially-equitable language identification

17

Human Development Index of text’s origin country

Estim

ated

acc

urac

y fo

r En

glish

twee

ts

Page 58: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Social Equality doesn’t stop at Language Identification

18

Methodologies capable of handling

language as it is used

Better social representation in

our data

Page 59: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

Social Equality doesn’t stop at Language Identification

18

Methodologies capable of handling

language as it is used

Better social representation in

our data

Page 60: Incorporating Dialectal Variability for Socially Equitable ...jurgens/docs/jurgens-tsvetkov-jurafsky... · Incorporating Dialectal Variability for Socially Equitable ... Dialect Less

19

David Jurgens, Yulia Tsvetkov, and Dan Jurafsky

Be equitable! https://github.com/davidjurgens/equilid