Hinrich Schu¨tze 2015-11-04 - uni-muenchen.de › ~hs › teach › 15w › pmclii › pdf ›...

LING reps Count vectors Deep learning POS tagging MORPH tagging Sentiment Semantics

Representation Learning for Domain Adaptation

Hinrich Schütze

Center for Information and Language Processing, University of Munich

2015-11-04

Schütze: Representation learning for domain adaptation 1 / 97


Overview

1 “Traditional” computational linguistics representations

2 Count vector representations

3 Deep learning representations

4 Task 1: Part-of-speech (POS) tagging

5 Task 2: Morphological (MORPH) tagging

6 Task 3: Sentiment analysis

7 Task 4: Semantic similarity between words

8 Conclusion



Outline








8 Conclusion



Representations created by (computational) linguists

Generative lexicon (Pustejovsky) entry for “build”:




Lexicon entry for “obeshchat’” in Tolkovo-kombinatornyj Slovar’Sovremennogo Russkogo Jazyka:




Morphological paradigm of French verb “faire”:



“Traditional” representations in computational linguistics




Motivated by linguistic theory





Many successes in practical applications





Many successes in practical applications

So why would we need any other representation incomputational linguistics?



Why learned representations: Problems with LING reps

Coverage

Domain dependence

Noise / need for robustness

Manual creation of representations for rich semantics / worldknowledge: unsolved problem



Problems for traditional CL: Coverage




Natural languages are productive: new words and meaningsare created all the time.





Example: “unfriend”






“A new study from a University of Colorado Denver gradstudent attempts to uncover what types of people we aremost likely to unfriend.”







Traditional CL: New words are not covered.







Traditional CL: New words are not covered.

Representation learning: Representations for new words canbe automatically learned.



Problems for traditional CL: Domain dependence




The language in many NLP applications has domain-specificproperties.





Example: Patents





Example: Patents

An apparatus for winding fence material comprising a leadingedge portion, wherein said apparatus is comprised of a firstshaft . . .





Example: Patents


The word “said” is used as a demonstrative here.





Example: Patents



Traditional CL: Domain-dependent usage not covered.





Example: Patents



Traditional CL: Domain-dependent usage not covered.

Representation learning: Representations fordomain-dependent usage can be automatically learned.



Problems for traditional CL: Noise / need for robustness

Tweet: “water” = “what are”

Amazon review: “since i was young i always dreamed of goingto walt disney world, but no that i live in florida i go thereevery chance i get!but the days i cant go i just play this game,its like being on the rides themselves!not to easy for you tobeat in a day”



Problems for traditional CL: Rich semantics / world

knowledge




knowledge

spider




knowledge

spiderTaxonomy: spi-ders are animals,similar to insects




knowledge


Attributes of spi-ders: venomous,fuzzy, small, fast-moving




knowledge



Typical actionsof spiders: bite,prey, weave,burrow




knowledge

Taxonomy: spi-ders are animals,similar to insects



Meaning is very het-erogeneous: abstract,concrete, sensory,core semantics, worldknowledge




knowledge





Traditional CL: Diffi-cult to represent allthis in a computation-ally useful way




knowledge





Traditional CL: Diffi-cult to represent allthis in a computation-ally useful way

Representation learn-ing: deal well withheterogeneity ofmeaning



Problems for traditional CL: Zipf

The long tail of language use

The adjective “hard”: die hard, hard by, hard and fast, hardcopy, hard back, hard core, hard disk, hard drive, hard drugs,hard earned, hard hit, hard rock, hard going, hard nosed, hardof hearing, hard put, hard to get, hard way






Not just memorization:






Not just memorization: “hard and fast”






Not just memorization: “hard and fast” → “fast and hard”,






Not just memorization: “hard and fast” → “fast and hard”,“hard back”






Not just memorization: “hard and fast” → “fast and hard”,“hard back” → “hard bound”,






Not just memorization: “hard and fast” → “fast and hard”,“hard back” → “hard bound”, “hard rock”






Not just memorization: “hard and fast” → “fast and hard”,“hard back” → “hard bound”, “hard rock” → “hard punk”,






Not just memorization: “hard and fast” → “fast and hard”,“hard back” → “hard bound”, “hard rock” → “hard punk”,“truly knuckles-scraping-against-asphalt hard”



Why learned representations: Problems with LING reps

Coverage

Domain dependence

Noise / need for robustness

Manual creation of representations for rich semantics / worldknowledge: unsolved problem



Types of representations used in NLP

NONE: No representation, except for word index

LING: Representations based on linguistic resources

COUNT: Count vectors

UNSU: Representations learned by unsupervised learning

PREDICT: Representations learned by supervised learning:embeddings / predict vectors









Next: COUNT vector models



Outline








8 Conclusion



Count vector models



Count vector models

Dimensionality is vocabulary V (or large subset thereof)



Count vector models

Dimensionality is vocabulary V (or large subset thereof)

Value of dimension i of distributional representation of wordv : (weighted) cooccurrence count of v and wi



Count vector model: The counts




Count the cooccurrence of two words in a large corpus





E.g., cooccurrence = cooccurrence within k = 10 words






Example counts from the Wikipedia






Example counts from the Wikipedia

cooc.(rich,silver) = 186cooc.(poor,silver) = 34cooc.(rich,disease) = 17cooc.(poor,disease) = 162cooc.(rich,society) = 143cooc.(poor,society) = 228



Count vector model: Vectors




rich

poor0

50

100

150

200

0 50 100 150 200 250




rich

poor0

50

100

150

200

0 50 100 150 200 250

cooc.(poor,silver)=34, cooc.(rich,silver)=186,




rich

poor0

50

100

150

200

0 50 100 150 200 250

silver





rich

poor0

50

100

150

200

0 50 100 150 200 250

silversilver





rich

poor0

50

100

150

200

0 50 100 150 200 250

silver





rich

poor0

50

100

150

200

0 50 100 150 200 250

silver

cooc.(poor,silver)=34, cooc.(rich,silver)=186,cooc.(poor,disease)=162, cooc.(rich,disease)=17,




rich

poor0

50

100

150

200

0 50 100 150 200 250

silver

disease

cooc.(poor,silver)=34, cooc.(rich,silver)=186,cooc.(poor,disease)=162, cooc.(rich,disease)=17,




rich

poor0

50

100

150

200

0 50 100 150 200 250

silver

disease

cooc.(poor,silver)=34, cooc.(rich,silver)=186,cooc.(poor,disease)=162, cooc.(rich,disease)=17,cooc.(poor,society)=228, cooc.(rich,society)=143




rich

poor0

50

100

150

200

0 50 100 150 200 250

silver

disease

society

cooc.(poor,silver)=34, cooc.(rich,silver)=186,cooc.(poor,disease)=162, cooc.(rich,disease)=17,cooc.(poor,society)=228, cooc.(rich,society)=143



Count vector model: Similarity




rich

poor0

50

100

150

200

0 50 100 150 200 250

silver

disease

society

The similarity between two words is the cosine of the anglebetween them.




rich

poor0

50

100

150

200

0 50 100 150 200 250

goldsilver

disease

society


Small angle: gold and silver are similar.




rich

poor0

50

100

150

200

0 50 100 150 200 250

goldsilver

disease

society


Large angle: gold and disease are not similar.




rich

poor0

50

100

150

200

0 50 100 150 200 250

goldsilver

disease

society










Next: PREDICT: predict vectors in deep learning



Terminology



Terminology

A distributed representation

is simply a vector representation, i.e., a point in a high-dimensionalreal-valued space. Implicit in the concept of distributedrepresentation is that similarity/distance is interpretable. E.g.,representing a 1000x1000 binary pixel image as a one milliondimensional vector is not distributed since similarity/distance doesnot have an intuitive interpretation.



Terminology



Embeddings/predict vectors, count vectors, representations learnedby unsupervised learning: these are all distributed representations.



Terminology



Embeddings/predict vectors, count vectors, representations learnedby unsupervised learning: these are all distributed representations.(Linguistic resources usually do not provide distributedrepresentations.)



Terminology (cont.)

A distributional representation

can be defined (i) as a representation based on distributionalinformation, i.e., on the distribution of words in contexts in a largecorpus or (ii) as a synonym of distributed representation.



Terminology (cont.)

A distributional representation

can be defined (i) as a representation based on distributionalinformation, i.e., on the distribution of words in contexts in a largecorpus or (ii) as a synonym of distributed representation.

Count vectors are distributional representations according todefinition (i). Embeddings/predict vectors and representationslearned by unsupervised learning may or may not be viewed asdistributional representations according to definition (ii) since thelink to the distribution of words in contexts is more indirect in thiscase.



Outline








8 Conclusion



Deep learning as a gestalt




Automatic learning of features(as opposed to hand-designed features)





nonlinear





nonlinear

Representation learning: embeddings or predict vectors





nonlinear


“deep” = “multi-layer architectures”



Geoff Hinton on Automatic Feature Learning

Adding a layer of hand-coded features . . . makes them much morepowerful but the hard bit is designing the features. We need toautomate the loop of designing features for a particular task andseeing how well they work.



Opposing view: Feature design still is important

Solving any complex task requires domain expertise.

Domain expertise can be used in various ways: definition oftask to be learned, collection and composition of trainingdata, architecture of machine learning system, design ofrepresentation, design of features

It’s unclear why domain expertise should be used for some ofthese, but not for feature design.



Deep learning: Nonlinearity




SVMs (perhaps the main competitor of neural networks) aremore efficient and have a better understood theory thanneural networks.





But they are linear.






The only “knob” you can turn is the kernel that gives thelearner access to similarity in a complex representation space.







Guess: Real life is complicated and often nonlinear.







Guess: Real life is complicated and often nonlinear.

Neural networks offer more flexibility in learning complexdecision boundaries.



Deep learning: Representation learning + Architectures




Representation learning





Supervised learning to train embeddings or predict-vectors forwords






Architectures: deep = multilayer






Architectures: deep = multilayer

Use trained predict-vectors/embeddings in a deep, multilayerneural network architecture



Embeddings = predict vectors (Schwenk & Koehn 2008)




Language modeling task: predict the next word wj from the n − 1preceding words wj−n+1,wj−n+2, . . . ,wj−1




Input representation of words: one-hot vectors




Output/target: multinomial classification, V classes, where V isthe size of the vocabulary




Embedding layer for learning predict vectors. There is only oneembedding per word, independent of position




Complex nonlinear decision surfaces can be learned due to hiddenlayer.




Embeddings/predict vectors are learned by backpropagating theprediction error.



Embeddings/predict vectors: Comments




Low dimensionality, can be used efficiently for a wide range ofNLP tasks





Supervised training, can in theory learn arbitrarily complexphenomena






Rare events as well as frequent events






Rare events as well as frequent eventsComplex contextual dependencies







Word order is respected.








Very cautious independence assumptions compared to countvectors (high-order Markov assumption)








Very cautious independence assumptions compared to countvectors (high-order Markov assumption)

Many different approaches to learning embeddings/ predictvectors


t-SNE visualiza-tion of a smallsubset of embed-dings / predictvectors

(FC Barcelona)

(Man Utd)

(Arsenal FC)

(InterMilan FC)

(Schalke)

(AC Milan)

(Reading UK)

(Reading PA)

Barcelona

EnglandLondon

(Berlin)

(London)

Washington

(Los Angeles)

(LA)

(Rome)

(Paris)

(NY)(WA)

(Chicago)

villagetowncity

islandpark

Bayern

(Reading VERB)(Reading VERB)

(Learning)

vocabulary

poetry

composing

semantics

translating

terminology

writing


(FC Barcelona)

(Man Utd)

(Arsenal FC)

(InterMilan FC)

(Schalke)

(AC Milan)


(Reading UK)

(Reading PA)

Barcelona

EnglandLondon

(Berlin)

(London)

Washington

(Los Angeles)

(LA)

(Rome)

(Paris)

(NY)(WA)

(Chicago)

villagetowncity

islandpark

Bayern

t-SNE visualiza-tion of a smallsubset of embed-dings / predictvectors (Reading VERB)(Reading VERB)

(Learning)

vocabulary

poetry

composing

semantics

translating

terminology

writing


(FC Barcelona)

(Man Utd)

(Arsenal FC)

(InterMilan FC)

(Schalke)

(AC Milan)

(Reading UK)

(Reading PA)

Barcelona

EnglandLondon

(Berlin)

(London)

Washington

(Los Angeles)

(LA)

(Rome)

(Paris)

(NY)(WA)

(Chicago)

villagetowncity

islandpark

Bayern

(Reading VERB)(Reading VERB)

(Learning)

vocabulary

poetry

composing

semantics

translating

terminology

writing

Deep learning: Deep architectures


Lookup table: this is where the learnedembeddings are retrieved and fed into thenetwork



Example of complex learning architecture:convolution, max over time, hidden layer



Example of complex learning architecture:convolution, max over time, hidden layerMeaning of the logical operator “or” 6=embedding of “or”


Vision: Deep network, automatic features



Domain knowledge built in, end-to-end



Key to remember: these are PREDICT vectors



Key to remember: these are COUNT vectors

rich

poor0

50

100

150

200

0 50 100 150 200 250

goldsilver

disease

society



Count vectors vs Embeddings/Predict Vectors

COUNT PREDICT




COUNT PREDICT

dimensionality




COUNT PREDICT

dimensionality high




COUNT PREDICT

dimensionality high low to medium




COUNT PREDICT

dimensionality high low to mediumlearning regime




COUNT PREDICT

dimensionality high low to mediumlearning regime unsupervised




COUNT PREDICT

dimensionality high low to mediumlearning regime unsupervised supervised




COUNT PREDICT

dimensionality high low to mediumlearning regime unsupervised supervisedcomplex linguistic context




COUNT PREDICT

dimensionality high low to mediumlearning regime unsupervised supervisedcomplex linguistic context hard to model




COUNT PREDICT

dimensionality high low to mediumlearning regime unsupervised supervisedcomplex linguistic context hard to model easier to model




COUNT PREDICT

dimensionality high low to mediumlearning regime unsupervised supervisedcomplex linguistic context hard to model easier to modelrare event coverage




COUNT PREDICT

dimensionality high low to mediumlearning regime unsupervised supervisedcomplex linguistic context hard to model easier to modelrare event coverage poor




COUNT PREDICT

dimensionality high low to mediumlearning regime unsupervised supervisedcomplex linguistic context hard to model easier to modelrare event coverage poor good




COUNT PREDICT

dimensionality high low to mediumlearning regime unsupervised supervisedcomplex linguistic context hard to model easier to modelrare event coverage poor goodindependence assumptions




COUNT PREDICT

dimensionality high low to mediumlearning regime unsupervised supervisedcomplex linguistic context hard to model easier to modelrare event coverage poor goodindependence assumptions strong




COUNT PREDICT

dimensionality high low to mediumlearning regime unsupervised supervisedcomplex linguistic context hard to model easier to modelrare event coverage poor goodindependence assumptions strong weak




COUNT PREDICT


simple




COUNT PREDICT


simple complex




COUNT PREDICT


simple complexefficient




COUNT PREDICT


simple complexefficient long training times




COUNT PREDICT


simple complexefficient long training timeselegant




COUNT PREDICT


simple complexefficient long training timeselegant messy



Vision: Deep network, automatic features

For natural language: what corresponds to pixels? whatcorresponds to edges? what corresponds to object parts? whatcorresponds to object models?





nonlinear


“deep” = “multi-layer architectures”



Deep learning: Why now?




Moore’s law




Moore’s law

Big data: Several orders of magnitude more than in 80s / 90s




Moore’s law


Better understanding of how to train very complex networks:initialization, regularization, much expanded bag of tricks




Moore’s law



Canonical machine learning stuck? – Great strides, but notrecently




Moore’s law



Canonical machine learning stuck? – Great strides, but notrecently

Diverse knowledge about the domain can be integrated into aneural network architecture in a very flexible way – but it stillcan be trained end-to-end.




NONE: No representation, except for word index (typicalapproach to supervised training in NLP is to have no initialrepresentation of a word)











UNSU: Representations learned by unsupervised learning:SVD, LSI, PLSI, NMF, Hellinger PCA, MSDA, (Brown)clustering








UNSU: Representations learned by unsupervised learning:SVD, LSI, PLSI, NMF, Hellinger PCA, MSDA, (Brown)clustering


Next: Which representation is best for NLP?Schütze: Representation learning for domain adaptation 43 / 97


Which representation is best for domain adaptation?

Task 1: Part-of-speech (POS) tagging

Task 2: Morphological (MORPH) tagging

Task 3: Sentiment

Task 4: Semantic similarity




Task 1: Part-of-speech (POS) tagging(very low complexity task)

Task 2: Morphological (MORPH) tagging

Task 3: Sentiment






Task 2: Morphological (MORPH) tagging(low complexity task)

Task 3: Sentiment







Task 3: Sentiment(medium complexity task)







Task 3: Sentiment(medium complexity task)

Task 4: Semantic similarity(high complexity task)



Problem setting: Domain adaptation



Outline








8 Conclusion

Schnabel & Schütze: POS tagging 48 / 97


This section based on: Schnabel & Schütze. FLORS: Fastand Simple Domain Adaptation for Part-of-Speech Tag-ging. In Transactions of the Association for ComputationalLinguistics (TACL), 2:1526, 2014



Task: Part-of-speech (POS) tagging




Disambiguate part-of-speech (syntactic category) in context





Example:time NNflies VBZlike INan DTarrow NN






“flies” can be a form of the verb “to fly” or the plural of thenoun “fly”.






“flies” can be a form of the verb “to fly” or the plural of thenoun “fly”.

It is correctly disambiguated here.



Representation for POS tagging




Formalize problem as classification of a 5-word context (usinglinear SVM)





Feature representation used for 5-word context:

suffix, shapeCOUNTUNSU: Brown clustersPREDICT: Collobert & Weston





Feature representation used for 5-word context:

suffix, shapeCOUNTUNSU: Brown clustersPREDICT: Collobert & Weston

Question: Which representation works best for POS tagging:COUNT, UNSU or PREDICT?



POS tagging: Results

newsgroups reviews weblogsALL OOV ALL OOV ALL OOV

COUNT 90.86 66.42 92.95 75.29 94.71 83.64UNSU 90.34∗ 62.41∗ 92.23∗ 71.47∗ 94.45 81.76PREDICT 90.57 64.57 92.54∗ 72.48∗ 94.51 80.58∗

answers emailsALL OOV ALL OOV

COUNT 90.30 62.15 89.44 62.61UNSU 89.71∗ 56.28∗ 89.02∗ 63.20PREDICT 90.23 60.99 89.44 63.13



POS tagging: Results

newsgroups reviews weblogsALL OOV ALL OOV ALL OOV

COUNT 90.86 66.42 92.95 75.29 94.71 83.64PREDICT 90.57 64.57 92.54∗ 72.48∗ 94.51 80.58∗

UNSU 90.34∗ 62.41∗ 92.23∗ 71.47∗ 94.45 81.76

answers emails INDOMAINALL OOV ALL OOV ALL OOV

COUNT 90.30 62.15 89.44 62.61 96.59 90.37PREDICT 90.23 60.99 89.44 63.13 96.72 90.48UNSU 89.71∗ 56.28∗ 89.02∗ 63.20 96.48∗ 87.50



Best representation for POS tagging

NONE: No representation



UNSU: Rep’s learned by unsupervised learning

PREDICT: Predict vectors

And the winner is:



POS tagging: Why is COUNT best?














COUNT is better than NONE because representation learning(doing some adaptation vs no adaptation) works in this case.










Why is COUNT better than UNSU and PREDICT?










Why is COUNT better than UNSU and PREDICT?

Hypothesis: POS tagging is a very simple problem, so you don’tneed a complex representation learning formalism.



Outline








8 Conclusion

Müller, Schmid, Schütze (in progress): MORPH tagging 60 / 97


This section based on: Müller, Schmid & Schütze. DomainAdaptation for Morphological Tagging. In progress.



Task: Morphological (MORPH) tagging




Disambiguate both part-of-speech and morphological features






Part-of-speech disambiguation: ART, NN, VFIN






Part-of-speech disambiguation: ART, NN, VFIN

Morphological disambiguation: case=nom, number=sg,tense=pres, mood=ind etc



Representation for MORPH tagging




Formalize problem as sequence classification (usinghigher-order CRF: MarMoT)





Feature representation used for each token:

NONE (word index), suffix, shapeUNSU: SVD, Brown clustersPREDICT: polyglot (Al-Rfou et al)LING: finite state morphology (Manually created linguisticresource)





Feature representation used for each token:

NONE (word index), suffix, shapeUNSU: SVD, Brown clustersPREDICT: polyglot (Al-Rfou et al)LING: finite state morphology (Manually created linguisticresource)

Question: Which representation works best for MORPHtagging: NONE, LING, UNSU or PREDICT?



MORPH tagging: In domain results

SVMTool Morfette MarMoT

NONE NONE NONE UNSU1 UNSU2 PREDICT LING

cs 91.06 91.48 93.86 94.15 94.16 94.13 94.52

hu 94.72 95.47 96.14 96.45 96.47 96.46 96.84



MORPH tagging: Results

MarMoT

NONE UNSU1 UNSU2 PREDICT LING

Czech 78.01 78.44 78.51 78.42 78.88

Hungarian 89.77 90.52 90.41 90.88 91.24

SVMTool Morfette MarMoT

NONE NONE NONE UNSU1 UNSU2 PREDICT LING

cs 75.28 76.04 78.01 78.44 78.51 78.42 78.88

hu 88.44 89.18 89.77 90.52 90.41 90.88 91.24



Best representation for morphology DA






And the winner is:



MORPH tagging: Why is LING best?








MORPH tagging: Why is LING best?






Hypothesis: Learning morphological paradigms is actually a prettyhard problem. So the representation learning algorithms failed?



Discussion



Discussion

Morphology is more Zipfian.



Discussion


This is a difference between English (morphologically poor)and Czech / Hungarian (morphologically rich).



Discussion


This is a difference between English (morphologically poor)and Czech / Hungarian (morphologically rich).

Something like gender is difficult to infer from count vectors.



Outline








8 Conclusion

Chen et al.: Sentiment 73 / 97


This section based on: Chen, Xu, Weinberger, Sha.Marginalized denoising autoencoders for domain adapta-tion. ICML 2012



Task: Sentiment analysis

For a review (of a book, a camera, a washing machine etc):determine if the review has positive polarity or negativepolarity.



Example of a review

I photograph almost 45 years and now is photography as my joband I am as a member of The Royal Photographic Society inEngland. I had bought the photographic books as new one and inthe secondhand bookstore. I have at this time maybe 2 meters longa queue of these photographic books in english, german and czechlanguage. I know, what is important information for photographerand what is the value the information in the proper time. . . . Isummarize the impression from this book: I can very hardrecommend this book not only for beginner but so for advancedphotographer with very strong interest ybout close-up photography.



Example of a review


categories: positive / neutral / negative



Example of a review


categories: positive / neutral / negative

classification decision: positive


LING reps Count vec

Hinrich Schu¨tze 2015-11-04 - uni-muenchen.de › ~hs › teach › 15w › pmclii › pdf ›...

Documents

Transcript of Hinrich Schu¨tze 2015-11-04 - uni-muenchen.de › ~hs › teach › 15w › pmclii › pdf ›...