CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system •...

CSE 291G : Deep Learning for Sequences

Paper presentation

Topic : Named Entity Recognition

Rithesh

Outline

• Named Entity Recognition and its applications.

• Existing methods

• Character level feature extraction

• RNN : BLSTM- CNNs

Named Entity Recognition (NER)


WHAT ?

Named Entity Recognition (entity identification, entity chunking & entity extraction)

• Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc.

• Ex : Kim bought 500 shares of IBM in 2010.




Person name Organization Time




• Importance of locating named entity in a sentence : Ex : Kim bought 500 shares of Bank of America in 2010.

Person name Organization Time


WHAT ?

WHY ?

Applications of NER

• Content Recommendations

• Customer support

• Classifying content for news providers

• Efficient Searching algorithms

• QA

• Machine Translation Systems

• Automatic Summarization system


WHAT ?

WHY ?

HOW ?

Approaches :

• ML Classification techniques (Ex : SVM, Perceptron model, CRF(Conditional Random Fields))

Drawback : Requires Hand-crafted features

• Neural Network Model (By Collobert – Natural Language Processing (almost) from scratch) Drawbacks : (i) Simple Feedforward NN with fixed window size (ii) Depends solely on word embeddings & fails to exploit character level features – prefix, suffix etc.

• RNN : LSTM – variable length input and long term memory

– First proposed by Hammerton in 2003

RNN : LSTM

• Overcome drawbacks of existing system

• Account for variable length input and long term memory

• Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S. Ex : Teddy bears are on sale. Teddy Roosevelt was a great president.

RNN : LSTM




SOLUTION : Bi-directional LSTM (BLSTM) - Captures Information from the past and from the future.

RNN : LSTM




SOLUTION : Bi-directional LSTM (BLSTM) - Captures Information from the past and from the future.

Fails to exploit character level features

Techniques to capture character level features

• Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS.

• Ling (2015) proposed a model for character level feature extraction using BLSTM for POS.




• CNN or BLSTM?




• CNN or BLSTM?

– BLSTM did not perform significantly better than CNN and also,

BLSTM is computationally more expensive to train.




• CNN or BLSTM?

– BLSTM did not perform significantly better than CNN and also,

BLSTM is computationally more expensive to train.

BLSTM : Word level feature extraction CNN : Character level feature extraction

Named Entity Recognition with Bidirectional LSTM-CNNs

Jason P.C. Chiu, Eric Nichols (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the

Association for Computational Linguistics, 4, 357-370.

• Inspired by : – Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray

Kavukcuoglu, and Pavel Kuksa. 2011b. Natural language processing (almost) from scratch. The journal of Machine Learning Research, 12:2493-2537.pages 25-33.

– Cicero Santos, Victor Guimaraes. 2015. Boosting named entity recognition with neural character embeddings. Proceedings of the fifth Named Entities Workshop,

Reference paper : Boosting NER with Neural Character Embeddings

• CharWNN deep neural network – uses word and character level representations(embeddings) to perform sequential classification.

• HAREM I : Portuguese SPA CoNLL-2002 : Spanish

• CharWNN extends Collobert et al.’s (2011) neural network architecture for sequential classification by adding a convolutional layer to extract character-level representations.

CharWNN • Input : Sentence

• Output : For each word in the sentence a score for each class



S : <w1, w2, .. wN>



S : <w1, w2, .. wN>

wn

un

un =[rwrd; rwch]

CNN for character embedding


W : <c1, c2, ..cM>


W : <c1, c2, ..cM>

Matrix vector operation with window size k


W : <c1, c2, ..cM>

Matrix vector operation with window size k

rwch



S : <w1, w2, .. wN>

wn

un

un =[rwrd; rwch]

rwch <u1, u2, .. uN>

CharWNN

• Input to convolution layer : <u1, u2, .. uN>

CharWNN


Two Neural Network layers

CharWNN


• For a Transition score matrix Atu

Two Neural Network layers

=

Network Training for CharWNN

• CharWNN is trained by minimizing the negative log-likelihood over the training set D.

• Interpret the sentence score as a conditional probability over a path (the score is exponentiated and normalized with respect to all possible paths)

• Stochastic gradient descent (SGD) to minimize the negative log-likelihood with respect to

Embeddings

• Word level Embedding : For Portuguese NER, the world level embeddings previously trained by Santos, 2004 was used. And for Spanish, Spanish wikipedia was used.

• Character level Embedding : Unsupervised learning of character level embeddings was NOT performed. The character level embeddings are initialized by randomly sampling each value from an uniform distribution.

Corpus : Portuguese & Spanish

Hyperparameters

Comparison of different NNs for the SPA CoNLL-2002 corpus

Comparison of different NNs for the SPA CoNLL-2002 corpus

Comparison with the state-of-the-art for the SPA CoNLL-2002 corpus

Comparison of different NNs for the HAREM I corpus

Comparison with the State-of-the-art for the HAREM I corpus

Chiu, J. P., & Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for

Computational Linguistics, 4, 357-370.

BLSTM : Word level feature extraction CNN : Character level feature extraction

Character Level feature extraction

Word level feature extraction

Embeddings

• Word embeddings : 50 dimensional word embeddings released by Collobert (2011b) : Wikipedia & Reuters RCV-I corpus. Also, Stanford’s Glove and Google’s word2vec.

• Character embeddings : randomly initialized lookup table with values drawn from a uniform distribution with range [-0.5, 0.5] to output a character embedding of 25 dimensions.

Additional Features

• Additional word level features : – Capitalization feature : allCaps, upperInitial, lowercase, mixedCaps,

noinfo.

– Lexicons : SENNA and DBpedia

Training and Inference • Implementation :

– torch7 library

– Initial state of LSTM set to zero vectors.

• Objective : Maximize sentence level log-likelihood – The objective function and its gradient can be efficiently computed by

Dynamic programming.

– Viterbi algorithm is used to find the optimal tag sequence [ i ]T that maximizes :

• Learning : Training was done by mini-batch stochastic gradient descent (SGD) with a fixed learning rate, and each mini-batch consists of multiple sentences with same number of tokens.

Results

Results : F1 scores of BLSTM and BLSTM-CNN with various addition features

( emb : Collobert word embeddings, Char : character type feature, caps : capitalization feature, Lex : lexicon feature )

Results : Word embeddings

Results : Various dropout values

Questions to discuss

• Why BLSTM-CNNs is the best choice?

• Is the proposed model Language independent?

• Is it a good idea to use additional features( Capitalization, prefix, suffix etc.) ?

• Possible Future Works..

Thank you!!

CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system •...

Documents

Transcript of CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system •...