Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky,...

Neural Networks

Instructor: Yoav Artzi

CS5740: Natural Language ProcessingSpring 2017

Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, Yejin Choi, and Slav Petrov

Overview• Introduction to Neural Networks• Word representations• NN Optimization tricks

Some History• Neural network algorithms date from the 80’s– Originally inspired by early neuroscience

• Historically slow, complex, and unwieldy• Now: term is abstract enough to encompass

almost any model – but useful!• Dramatic shift in last 2-3 years away from

MaxEnt (linear, convex) to “neural net” (non-linear architecture)

The “Promise”• Most ML works well because of

human-designed representations and input features

• ML becomes just optimizing weights

• Representation learningattempts to automatically learn good features and representations

• Deep learning attempts to learn multiple levels of representation of increasing complexity/abstraction

Named Entity Recognition

WordNet

Semantic Role Labeling

Neuron• Neural networks comes with their

terminological baggage

• Parameters: – Weights: wi and b– Activation function

• If we drop the activation function, reminds you of something?

Biological “Inspiration”

Neural Network

Matrix NotationW00(W0a+ b0) + b00

h1 = W011a1 +W0

12a2 + b01

h2 = W021a1 +W0

22a2 + b02

o2 = W

0021h1 +W

0022h2 + b

o1 = W

0011h1 +W

0012h2 + b

Neuron and Other Models• A single neuron is a perceptron• Strong connection to MaxEnt – how?

From MaxEnt to Neural Nets• Vector form MaxEnt:

• For two classes:

P (y1|x;w) =e

>�(x,y1)

>�(x,y1) + e

>�(x,y2)

>�(x,y1)

>�(x,y1) + e

>�(x,y2)

>�(x,y1)

>(�(x,y2)��(x,y2))

= f(w>z)

P (y|x;w) = e

>�(x,y)

>�(x,y0)

LogisitcFunction(sigmoid)

z = �(x, y1)� �(x, y2)

• For two classes:

• Neuron:– Add an “always on” feature for class prior à bias

term (b)

P (y|x;w) = e

>�(x,y)

>�(x,y0)

P (y1|x;w) =1

�w>z= f(w>

hw,b(z) = f(w>z + b)

f(u) =1

1 + e�u

• Neuron:

• Neuron parameters: w, b

P (y|x;w) = e

>�(x,y)

>�(x,y0)

hw,b(z) = f(w>z + b)

f(u) =1

1 + e�u

Neural Net = Several MaxEntModels

• Feed a number of MaxEnt models àvector of outputs

• And repeat …

Neural Net = Several MaxEntModels

• But: how do we tell the hidden layer what to do?– Learning will figure it out

How to Train?• No hidden layer:– Supervised– Just like MaxEnt

• With hidden layers:– Latent units à not convex–What do we do?• Back-propagate the gradient• About the same, but no guarantees

Probabilistic Output from Neural Nets

• What if we want the output to be a probability distribution over possible outputs?

• Normalize the output activations using softmax:

– Where 𝑞 is the output layer

y = softmax(W · z + b)

softmax(q) =

Pkj=1 e

Word Representations• So far, atomic symbols:– “hotel”, “conference”, “walking”, “___ing”

• But neural networks take vector input• How can we bridge the gap?• One-hot vectors

– Dimensionality:• Size of vocabulary• 20K for speech• 500K for broad-coverage domains• 13M for Google corpora

hotel = [0 0 0 0 … 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]conference = [0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]

Word Representations• One-hot vectors:

– Problems?– Information sharing? • “hotel” vs. “hotels”

hotel = [0 0 0 0 … 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]conference = [0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]hotels = [0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]

Word Embeddings• Each word is represented using a dense

low-dimensional vector– Low-dimensional << vocabulary size

• If trained well, similar words will have similar vectors

• How to train? What objective to maximize? – Soon …

Word Embeddings as Features• Example: sentiment classification – very positive, positive, neutral, negative, very

negative• Feature-based models: bag of words• Any good neural net architecture?– Concatenate all the vectors• Problem: different document à different length

– Instead: sum, average, etc.

Neural Bag-of-words

[Iyyer et al. 2015; Wang and Manning 2012]

Deep Averaging Networks

BOW + fancy smoothing + SVM

NBOW + DAN 89.4

IMDB sentiment analysis

Practical Tips• Select network structure appropriate for the

problem– Window vs. recurrent vs. recursive– Non-linearity function

• Gradient checks to identify bugs• Parameter initialization• Model is powerful enough?– If not, make it larger – Yes, so regularize, otherwise it will overfit

• Know your non-linearity function and its gradient

Avoiding Overfitting• Reduce model size (but not too much)• L1 and L2 regularization• Early stopping (e.g., patience)• Dropout (Hinton et al. 2012)– Randomly set 50% of inputs in each layer to 0

Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky,...

Documents

Transcript of Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky,...

CSE 517 Natural Language Processing Winter2015 Yejin Choi - University of Washington [Slides from Jason Eisner, Dan Klein, Luke Zettlemoyer] Feature Rich.

Chebanenko Slav PDF

Portfolio - Slav Vitanov

CSE 517 Natural Language Processing Winter 2015 Yejin Choi University of Washington [Many slides from Dan Klein, Michael Collins, Luke Zettlemoyer] Hidden.

Harding - Slav Tales (1896)

Insideadog SLAV Nov 11

CSE 573: Artificial Intelligence Reinforcement Learning II Dan Weld Many slides adapted from either Alan Fern, Dan Klein, Stuart Russell, Luke Zettlemoyer.

Parsing German with Latent Variable Grammars Slav Petrov and Dan Klein UC Berkeley.

OR Caro Slav repertoire c6.pdf · With a Slav move order, we enter the hive of the Semi-Slav. If the Caro section of the book is Dr. Jekyll, then the Semi-Slav portion is our Mr.

Slav Exhange Variation

Dependency Parsing - Cornell University · Dependency Parsing Instructor: Yoav Artzi CS5740: Natural Language Processing Slides adapted from Dan Klein, Luke Zettlemoyer, Chris Manning,

Improved Inference for Unlexicalized Parsing Slav Petrov and Dan Klein.

Play the Slav - Vigus(2008)

Uiliam Keit. Cena Slav

The Slav - Sadler

Natural Language Processing - …courses.cs.washington.edu/courses/cse517/13wi/slides/cse...Natural Language Processing Luke Zettlemoyer - University of Washington [Slides from Dan

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

CSE 473: Artificial Intelligence Dan Weld Slides from Dan Klein, Luke Zettlemoyer, Stuart Russell, Andrew Moore.

CSE 473: Artificial Intelligence Spring 2012 Search: Cost Heuristics Luke Zettlemoyer Lecture adapted from Dan Kleins slides Multiple slides from Stuart.

Alphonse Mucha12, The Slav Epic