Edinburgh MT lecture2: Probability and Language Models
-
Upload
alopezfoo -
Category
Technology
-
view
388 -
download
1
Transcript of Edinburgh MT lecture2: Probability and Language Models
•Homework 1 posted, due January 28
•Recommended work plan:
•Complete “Getting Started” by TOMORROW
•Complete “Baseline” by next Friday
•Complete “The Challenge” by January 28
Learn Write a functiondef learn(parallel_data): # do something return parameters
def translate(French, parameters): # do something return English
T : ⌃⇤f ⇥⇥ ! ⌃⇤
e
L : (⌃⇤f ⇥ ⌃⇤
e)⇤ ! ⇥
T (f, ✓) = arg max
e2⌃⇤e
p✓(e|f)Using probability:
Why probability?
•Formalizes...
•the concept of models
•the concept of data
•the concept of learning
•the concept of inference (prediction)
•Derive logical conclusions in the face of ambiguity.
Basic Concepts•Sample space S: set of all possible outcomes.
•Event space E: any subset of the sample space.
•Random variable: function from S to a set of disjoint events in S.
•Probability measure P: a function from events to positive real numbers satisfying these axioms:
1.
2.
3.
8E 2 F, P (E) � 0
P (S) = 1
8E1, ..., Ek
k\
i=1
Ei = ; ) P (E1 [ ... [ Ek) =kX
i=1
P (Ei)
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
X, Y
S = {1, 2, 3, 4 5, 6}2
r.v. X(x,y) = x, Y(x,y) = y
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
p(X = 1, Y = 1) =136
A probability over multiple events is a joint probability.
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
p(1, 1) =136
A probability over multiple events is a joint probability.
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
p(Y = 1) =X
x2X
p(X = x, Y = 1) =16By axiom 3
A probability distribution over a subset of variables is a marginal probability.
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
p(X = 1) =X
y2Y
p(X = 1, Y = y) =16
A probability distribution over a subset of variables is a marginal probability.
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
joint marginal P (Y = 1|X = 1) =
P (X = 1, Y = 1)Py2Y P (X = 1, Y = y)
=16
The probability of a r.v. when the when the values of the other r.v.’s are known is its conditional probability.
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
136136136136136136
A variable is conditionally independent of another iff its marginal probability = its conditional probability.
In other words, if knowing X tells me nothing about Y.
P (Y = 1|X = 1) = P (Y = 1) =1
6
16
16
16
16
16
16
P(X) =
if X =
16
16
16
16
16
16
P(Y) =
if Y =
P(X,Y) = P(X)P(Y)Far fewer parameters!
20°C15°C10°C5°C0°C-5°C
20°C15°C10°C5°C0°C-5°C
000
.003.01.03
.2.25.2
.147.09.07
p(snow|-5°C) = .30
In most interesting models, variables are not conditionally independent.
Under this distribution, temperature and weather r.v.’s are not conditionally independent!
20°C15°C10°C5°C0°C-5°C
20°C15°C10°C5°C0°C-5°C
000
.003.01.03
.2.25.2
.147.09.07
p(snow) = .043
In most interesting models, variables are not conditionally independent.
p(snow|-5°C) = .30
p(English|Chinese) =
p(English) × p(Chinese|English)
p(Chinese)
likelihoodprior
evidence
Bayes’ Rule
p(English|Chinese) =
p(English) × p(Chinese|English)
p(Chinese)
channel modelsignal model
normalization (ensures we’re working with valid probabilities).
Noisy Channel
p(English|Chinese) =
p(English) × p(Chinese|English)
p(Chinese)
translation modellanguage model
normalization (ensures we’re working with valid probabilities).
Machine Translation
p(English|Chinese) =
p(English) × p(Chinese|English)
p(Chinese)
translation modellanguage model
evidence
Machine Translation
p(English|Chinese) ∼
p(English) × p(Chinese|English)
Machine Translation
How do we define the probability of a sentence?
How do we define the probability of a Chinese sentence, given a particular English sentence?
Questions we must answer:
What is the sample space?
n-gram Language Models
S = V*V = set of all English words
Define an infinite set of events:Xi(s) = ith word in s if len(s)≥i, ε otherwise.
Must define: p(X0...X∞)
n-gram Language Models
S = V*V = set of all English words
Define an infinite set of events:Xi(s) = ith word in s if len(s)≥i, ε otherwise.
Must define: p(X0...X∞)
= p(X0) p(X1|X0) .... p(Xk|X0...Xk-1) ....by chain rule:
n-gram Language Models
S = V*V = set of all English words
Define an infinite set of events:Xi(s) = ith word in s if len(s)≥i, ε otherwise.
Must define: p(X0...X∞)
= p(X0) p(X1|X0) .... p(Xk|X0...Xk-1) ....by chain rule:
= p(X0) p(X1|X0) .... p(Xk|Xk-1) ....assume conditional independence:
Key idea: since the language model is a joint model over all words in a sentence, make each word depend on n-1 previous
words in the sentence.
n-gram Language Models
p(English) =
length(English)!
i=1
p(wordi|wordi−1)
Language Models
Note: the probability that word0=START is 1.
p(English) =
length(English)!
i=1
p(wordi|wordi−1)
Language Models
Note: the probability that word0=START is 1.
This model explains every word in the English sentence.