Representation for POS Tagging Learning Character...

25
Learning Character Level Representation for POS Tagging Cıcero Nogueira dos Santos Bianca Zadrozny Presented By Anirban Majumder

Transcript of Representation for POS Tagging Learning Character...

Page 1: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Learning Character Level Representation for POS Tagging

Cıcero Nogueira dos SantosBianca Zadrozny

Presented ByAnirban Majumder

Page 2: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Introduction : Distributed Word Embedding

● Useful technique to capture syntactic and semantic information about words.

● But for many of the NLP task such as POS tagging, the information about word morphology and shape is important which is not captured in these embeddings.

● Proposes a deep neural network to learn Character-Level Representation to capture intra-word information.

Page 3: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Char-WNN Architecture

● joins two word and character level embedding for POS tagging

● extension of Collobert et al’s(2011) NN architecture

● Uses a convolutional layer to extract char-embedding for word of any size

Page 4: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Char-WNN Architecture

● Input: Fixed sized window of words centralized in target word

● Output: For each word in a sentence, the NN gives each word a score for each tag τ ∈ T (Tag Set)

Page 5: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Word and Char-Level Embedding

● word is from a fixed size vocabulary Vwrd and every word w ∈ Vchr

, a fixed size of character vocabulary

● Two embedding matrix are used:

Wwrd ∈ Rd wrd ×|V wrd|

Wchr ∈ R d chr ×|V chr|

Page 6: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Word and Char-Level Embedding

● Given a sentence with n words {w1,w2,...,wn}, each word wn is converted into a vector representation un as follows:

un= { rwrd ; rwch }

where rwrd ∊ Rdwrd is the word level embeddingand rwch ∊ Rclu is the character level embedding

Page 7: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Word and Char-Level Embedding

Page 8: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Word and Char-Level Embedding

Page 9: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Char-Level Embedding : Details

● Produces local features around each character of the word

● combines them to get a fixed size character-level embedding

● Given a Word w composed of M characters {c1,c2,...,cM}, each cM is transformed into a character embedding rm

chr . Them input to the convolution layer is the sequence of character embedding of M characters.

Page 10: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Char-Level Embedding : Details

● window of size kchr (character context window) of successive windows in the sequence of {rchr

1 , rchr

2 , ..., rchr

M }

● The vector zm (concatenation of character embedding m)for each character embedding is defined as follows :

zm = (rchrm−(kchr−1)/2 , ..., r

chrm+(kchr−1)/2 )

T

Page 11: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Char-Level Embedding : Details

● Convolutional layer computer the jth element of the character embedding rwch of the word w as follows:

[rwch]j = max1<m<M[W0zm + b0]j

● Matrix W0 is used to extract local features around each character window of the given word

● Global fixed-sized feature vector is obtained using max operator over each character window

Page 12: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Char-Level Embedding : Details

● Parameter to be learned :Wchr, W0 and b0

● Hyper-parameters :

dchr : the size of the character vectorclu : the size of the convolution unit

(also the size of the character-level embedding) kchr : the size of the character context window

Page 13: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Scoring

● follow Collobert et al.’s (2011) window approach to score all tags T for each word in a sentence

● the assumption that the tag of a word depends mainly on its neighboring words

● to compute tag scores for the n-th word in the sentence, we first create a vector xn resulting from the concatenation of a sequence of kwrd embeddings, centralized in the n-th word

Page 14: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Scoring

● the vector xn :

xn = (un−(kwrd−1)/2 , ..., un+(kwrd−1)/2)T

● the vector xn is processed by two NN layers to compute scores :s(xn) = W2 h(W1 xn + b1) + b2

where W1 ∈ Rhl u×k wrd(d wrd+cl u)

W2 ∈ R|T|×hl u

Page 15: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Structured Inference :

● the tags of neighbouring words are strongly dependent

● prediction scheme that takes into the sentence structure (Collobert et. al, 2011)

Page 16: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Structured Inference :

● We compute the score for a tag path [t]1N={t1,t2,...,tn} as

S ([w]1N,[t]1

N,θ) = ∑n=1N(Atn−1,tn + s(xn)tn)

s(xn)tn is the score for the tag tn for the word wn Atn-1,

tn is a transition score for jumping from tag tn-1 to tag tn

θ is the set of all trainable network parameters (Wwrd, Wchr, W0 , b0 , W1 , b1 , W2 , b2 , A)

Page 17: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Network Training :

● network is trained by minimizing a negative log-likelihood over the training set D, same as Collobert et al.(2011)

● interpret a sentence score as a conditional probability over a path

log p( [t]N1|[w]N

1,θ) = S([w]N1,[t]

N1,θ)

− log(∑X ∀[u]N1∈TN e S([w]N1,[u]N1,θ))

● used stochastic gradient descent to minimize the negative log-

likelihood with respect to θ

Page 18: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

English Datasets

SET SENT. TOKENS OOSV

OOUV TRAINING 38,219 912,344 0 6317

DEVELOP. 5,527 131,768 4,467 958

TEST 5,462 129,654 3,649 923

WSJ Corpus

Experimental Setup : POS Tagging Datasets

Portuguese Datasets

SET SENT. TOKENS OOSV

OOUV TRAINING 42,021 959,413 0 4155

DEVELOP. 2,212 48,258 1360 202

TEST 9,141 213,794 9523 1004

Mac-Morpho Corpus

Page 19: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

English POS Tagging Results:

SYSTEM FEATURES ACC. ACC.OOSV ACC. OOUV

CHARWNN – 97.32 89.86 85.48

WNN CAPS+SUF2 97.21 89.28 86.89

WNN CAPS 97.08 86.08 79.96

WNN SUF2 96.33 84.16 80.61

WNN – 96.13 80.68 71.94

Comparison of different NNs for POS Tagging of the WSJ Corpus

Page 20: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Portuguese POS Tagging Results:

SYSTEM FEATURES ACC. ACC.OOSV ACC. OOUV

CHARWNN – 97.47 92.49 89.74

WNN CAPS+SUF3 97.42 92.64 89.64

WNN CAPS 97.27 90.41 86.35

WNN SUF3 96.35 85.73 81.67

WNN – 96.19 83.08 75.40

For POS Tagging of the Mac-Morpho Corpus

Page 21: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Results:

● Most similar words using character-level embeddings learned with WSJ Corpus

INCONSIDERABLE 83-YEAR-OLD SHEEP-LIKE DOMESTICALLY UNSTEADINESS 0.0055

INCONCEIVABLE 43-YEAR-OLD ROCKET-LIKE FINANCIALLY UNEASINESS 0.0085

INDISTINGUISHABLE 63-YEAR-OLD FERN-LIKE ESSENTIALLY UNHAPPINESS 0.0075

INNUMERABLE 73-YEAR-OLD SLIVER-LIKE GENERALLY UNPLEASANTNESS 0.0015

INCOMPATIBLE 49-YEAR-OLD BUSINESS-LIKE IRONICALLY BUSINESS 0.0040

INCOMPREHENSIBLE 53-YEAR-OLD WAR-LIKE SPECIALLY UNWILLINGNESS 0.025

Page 22: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Results:

● Most similar words using word-level embeddings learned using unlabeled English texts

INCONSIDERABLE 00-YEAR-OLD SHEEP-LIKE DOMESTICALLY UNSTEADINESS 0.0000

INSIGNIFICANT SEVENTEEN-YEAR-OLD BURROWER WORLDWIDE PARESTHESIA 0.00000

INORDINATE SIXTEEN-YEAR-OLD CRUSTACEAN-LIKE 000,000,000 HYPERSALIVATION 0.000

ASSUREDLY FOURTEEN-YEAR-OLD TROLL-LIKE 00,000,000 DROWSINESS

0.000000 UNDESERVED NINETEEN-YEAR-OLD SCORPION-LIKE SALES DIPLOPIA

±

SCRUPLE FIFTEEN-YEAR-OLD UROHIDROSIS RETAILS BREATHLESSNESS -0.00

Page 23: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Results:

● Most similar words using word-level embeddings learned using unlabeled Portuguese texts

GRADAÇÕES CLANDESTINAMENTE REVOGAÇÃO DESLUMBRAMENTO DROGASSE

TONALIDADES ILEGALMENTE ANULAÇÃO ASSOMBRO –

MODULAÇÕES ALI PROMULGAÇÃO EXOTISMO –

CARACTERIZAÇÕES ATAMBUA CADUCIDADE ENFADO –

NUANÇAS BRAZZAVILLE INCONSTITUCIONALIDADE ENCANTAMENTO –

COLORAÇÕES ˜ VOLUNTARIAMENTE NULIDADE FASCÍNIO –

Page 24: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Future Work :

● Analyzing the interrelationship between the two embeddings in more details

● Applying this work to other NLP tasks such as text chunking, NER etc.

Page 25: Representation for POS Tagging Learning Character Levelhome.iitk.ac.in/~manirban/cs671/hw3/Learning... · Representation for POS Tagging Cıcero Nogueira dos Santos ... 2,212 48,258

Thank You