Lecture 25: CS573 Advanced Artificial Intelligence Milind Tambe Computer Science Dept and...
-
date post
19-Dec-2015 -
Category
Documents
-
view
220 -
download
3
Transcript of Lecture 25: CS573 Advanced Artificial Intelligence Milind Tambe Computer Science Dept and...
Lecture 25:Lecture 25:CS573CS573
Advanced Artificial IntelligenceAdvanced Artificial Intelligence
Milind TambeComputer Science Dept and Information Science Inst
University of Southern California
Surprise Quiz II: Part ISurprise Quiz II: Part I
A
B CA P(B)
T 0.9
F 0.05
A P(C)
T 0.7
F 0.01
P(A) = 0.05
Questions: Surprise
MarkovMarkov
MarkovMarkov
Dynamic Belief NetsDynamic Belief Nets
Xt
Et
In each time slice:• Xt = Observable state variables• Et = Observable evidence variables
Xt+1
Et+1
Xt+2
Et+2
Types of InferenceTypes of Inference Filtering or monitoring: P(Xt | e1, e2…et)
– Keep track of probability distribution over current states– Like POMDP belief state– P(@ISI | c1,c2….ct) and P(N@ISI | c1,c2…ct)
Prediction: P(Xt+k | e1,e2…et) for some k > 0– P(@ISI 3 hours from now | c1,c2…ct)
Smoothing or hindsight: P(Xk | e1, e2…et) for 0 <= k < t– What is the state of the user at 11 Am, if observations at
9AM,10AM,11AM, 1pm, 2 pm
Most likely explanation: Given a sequence of observations, find the sequence of states that is most likely to have generated the observations (speech recognition)
– Argmaxx1:t P(X1:t|e1:t)
Filtering: P(Xt+1 | e1,e2…et+1)Filtering: P(Xt+1 | e1,e2…et+1)
P(Xt+1 | e1:t+1) = f1:t+1
= Norm * P(et+1 | Xt+1) * P(Xt+1 | xt) * P(xt|e1:t) xt
• e1:t+1 = e1, e2…et+1 • P(xt|e1:t) = f1:t
• f1:t+1 = Norm-const * FORWARD (f1:t, et+1)
RECURSION
Computing Forward Computing Forward ff1:t+11:t+1
• For our example of tracking user location:
• f1:t+1 = Norm-const * FORWARD (f1:t, ct+1)
• Actually it is a vector, not a single quantity
• f1:2 = P(L2 | c1, c2) implies computing for both < P(L2 = @ISI | c1, c2), P(L2 = N@ISI | c1, c2) > Then normalize
Hope you tried out all the computations from the last lecture at home!
Robotic PerceptionRobotic Perception
Xt
Et
Xt+1
Et+1
Xt+2
Et+2
At-1 At At+1
• At = action at time t (observed evidence)• Xt = State of the environment at time t• Et = Observation at time t (observed evidence)
Robotic PerceptionRobotic Perception
• Similar to filtering task seen earlier• Differences:
• Must take into account action evidence
Norm * P(et+1 | Xt+1) * P(Xt+1 | xt, at) * P(xt|e1:t) xt
POMDP belief update?
• Must note that the variables are continuous
P(Xt+1 | e1:t+1, a1:t)
= Norm * P(et+1 | Xt+1) * ∫ P(Xt+1 | xt,at) * P(xt|e1:t, a1:t-1)
PredictionPrediction Filtering without incorporating new evidence P(Xt+k | e1,e2…et) for some k > 0
– E.g., P( L3 | c1)
= P(L3 | L2) * P(L2 | c1)
= (P(L3=@ISI|L2=@ISI)*P(L2=@ISI|c1) +
P(L3=@ISI|L2=N@ISI)*P(L2=N@ISI|c1)
= 0.7 * 0.6272 + 0.3 * 3728
= 0.43904 + 0.1118 = 0.55
– P(L4 | c1) = P(L4 | L3) * P(L3 | c1)
= 0.7 * 0.55 + 0.3 * 0.45 = 0.52
Computed in the lastlecture
Computed in the lastlecture
PredictionPrediction
– P(L5 | c1) = 0.7 * 0.52 + 0.3* 0.48 = 0.508
– P(L6 | c1) = 0.7 * 0.5 + 0.3 * 0.5 = 0.5… (converging to 0.5)
Predicted distribution of user location converges to a fixed point– Stationary distribution of the markov process– Mixing time: Time taken to reach the fixed point
Prediction useful if K << mixing time– The more uncertainty there is in the transition model– The shorter the mixing time; more difficult to make predictions
SmoothingSmoothing
P(Xk | e1, e2…et) for 0 <= k < t
P(Lk | c1,c2…ct) = Norm * P(Lk | c1,c2..ck) * P(ck+1..ct | Lk)
= Norm * f1:k * bk+1:t
bk+1:t is a backward message, like our earlier forward message
Hence algorithm called forward-backward algorithm
bbk+1:t backward messagek+1:t backward messagebk+1:t = P(ek+1:t | Xk)
= P(ek+1,ek+2…. et | Xk)
= P(ek+1,ek+2…. et | Xk, Xk+1) P (xk+1 | Xk)
xk+1
Xk
Ek
Xk+1
Ek+1
Xk+2
Ek+2
bbk+1:t backward messagek+1:t backward message
bk+1:t = P(ek+1:t | Xk)
= P(ek+1,ek+1…. et | Xk)
= P(ek+1,ek+1…. et | Xk, Xk+1) P (xk+1 | Xk)
xk+1
= P(ek+1,ek+1…. et | Xk+1) P (xk+1 | Xk)
xk+1
= P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk)
xk+1
bbk+1:t backward messagek+1:t backward message
P(ek+1:t | Xk) = bk+1:t
= P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk)
xk+1
bk+1:t = BACKWARD(bk+2:t, ek+1:t)
bk+1:t = P(ek+1:t | Xk)
= P(ek+1,ek+1…. et | Xk)
= P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk)
xk+1
Example of SmoothingExample of Smoothing P(L1 = @ISI | c1, c2)
= Norm * P(Lk | c1,c2..ck) * P(ck+1..ct | Lk)
= Norm * P(L1 | c1) * P(c2 | L1) = Norm * 0.818 * P(c2 | L1)
P(c2 | L1 = @ISI) = P(ek+1:t | Xk) =
P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk)
xk+1
=> P(c2 | L2) * P(c3:2|L2) * P(L2 | L1)
L2
= [ (0.9 * 1* 0.7) + (0.2 * 1* 0.3)] = 0.69
Example of SmoothingExample of SmoothingP(c2 | L1 = @ISI) = P(c2 | L2) * P(L2 | L1)
L2
= [ (0.9 * 0.7) + (0.2 * 0.3)] = 0.69
P(L1 = @ISI | c1, c2) = Norm * 0.818 * 0.69 = Norm * 0.56442 P(L1 = N@ISI | c1, c2) = Norm * 0.182 * 0.41 = Norm * 0.074 After normalization: P(L1 = @ISI | c1, c2) = .883
Smoothed estimate .883 > Filtered estimate P(L1=@ISI | c1)! WHY?
HMMHMM
HMMHMM
Hidden Markov Models Speech recognition perhaps the most popular application
– Any speech recognition researcher in class? – Waibel and Lee– Dominance of HMMs in speech recognition from 1980s– For ideal isolated conditions they say 99% accuracy– Accuracy drops with noise, multiple speakers
Find applications everywhere just try putting in HMM in google
First we gave Bellman update to AI (and other sciences) Now we make our second huge contribution to AI: Viterbi
algorithm!
HMMHMM Simple nature of HMM allow simple and
elegant algorithms
Transition model P(Xt+1 | Xt) for all values of Xt– Represented as a matrix |S| * |S|– For our example: Matrix “T”– Tij = P(Xt= j | Xt-1 = i)
Sensor model also represented as a Diagonal matrix– Diagonal entries give P(et | Xt = i)– et is the evidence, e.g., ct = true– Matrix Ot
7.03.0
3.07.0
2.00
09.0
HMMHMM
• f1:t+1 = Norm-const * FORWARD (f1:t, ct+1)
= Norm-const * P(ct+1 | Lt+1)
* P(Lt+1 | Lt) * P(Lt|c1,c2…ct)
= Norm-const * Ot+1 * TT * f1:t
f1:2 = P (L2 | c1, c2) = Norm-const * O2 * TT * f1:1
= Norm-const * * *
2.00
09.0
7.03.0
3.07.0
182.0
818.0
TransposeTranspose
HMMHMM
• f1:2 = P (L2 | c1, c2) = Norm-const * O2 * TT * f1:1
= Norm-const * * *
= Norm-const * *
= Norm * <(0.63*0.818 + 0.27 * .182) (0.06*0.818 + 0.14 * .182)>
= Norm * <0.564, 0.074> after normalization
= <0.883, 0.117>
2.00
09.0
7.03.0
3.07.0
14.006.0
27.063.0
182.0
818.0
182.0
818.0
Backward in HMMBackward in HMM
P(ek+1:t | Xk) = bk+1:t
= P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk)
xk+1
= T * Ok+1 * bk+2:t
P(c2 | L1 = @ISI) = b2:2 =
7.03.0
3.07.0
2.00
09.0* * b3:2
BackwardBackward
• bk+1:t = T*Ok+1 * bk+2:t
• b3:2 = T*O2
• = * *
• = ( 0.69 0.41 )
7.03.0
3.07.0
2.00
09.0
1
1
Key Results for HMMsKey Results for HMMs
• f1:t+1 = Norm-const * Ot+1 * TT * f1:t
• bk+1:t = T*Ok+1 * bk+2:t
Inference in DBNInference in DBN
How to do inference in a DBN in general? Could unroll the loop forever…
Xt
Et
Xt+1
Et+1
Xt+2
Et+2
Xt+3
Et+3
Xt+1
Et+1
• Slices added beyond the last observation have no effect on inference WHY? • So only keep slices within the observation period
Inference in DBNInference in DBN
Xt
Et
Xt+1
Et+1
Alarm
JOHN
Mary
Et+3
Xt+1
Et+1
• Slices added beyond the last observation have no effect on inference WHY? • P(Alarm | JohnCalls) independent of MaryCalls
Complexity of inference in DBNComplexity of inference in DBN
Keep almost two slices in memory– Start with slice 0– Add slice 1– “Sum out” slice 0 (get a probability distribution over slice 1
state; don’t need to go back to slice 0 anymore – like POMDPs)– Add slice 2, sum out slice 1…
Constant time and space per update
Unfortunately, update exponential in the number of state variables Need approximate inference algorithms
Solving DBNs in GeneralSolving DBNs in General
Exact methods: – Compute intensive– Variable elimination from Chapter 14
Approximate methods:– Particle filtering popularity– Run N samples together through slices of the DBN network– All N samples constitute the forward message
– Highly efficient– Hard to provide theoretical guarantees
Next LectureNext Lecture
Continue with Chapter 15
Student EvaluationsStudent Evaluations
Surprise Quiz II: Part IISurprise Quiz II: Part II
Xt
Et
Xt+1
Et+1 E’t+1
Xt+1 P(E’)
T 0.7
F 0.01
Xt+1 P(E)
T 0.8
F 0.01
Xt P(Xt+1)
T 0.5
F 0.5
Question:
Most Likely PathMost Likely Path
Given a sequence of observations, find the sequence of states that most likely have generated these observations
E.g., in the E-elves example, suppose
[activity, activity, no-activity, activity, activity]
What is the most likely explanation of the presence of the user at ISI over the course of the day? – Did the user step out at time = 3?– Was the user present all the time, but was in a meeting at time 3
Argmaxx1:t P (X1:t| e1:t)
Not so simple…Not so simple…
Use smoothing to find the posterior distribution at each time step E.g., compute P(L1=@ISI | c1:5), P(L1=N@ISI | c1:5), find max Do the same for P(L2=@ISI|c1:5) vs P(L2=N@ISI|c1:5) find max
Find the maximum this way Why might this be different from computing what we want (the
most likey sequence)?
maxx1:t+1 P (X1:t+1| e1:t+1) via viterbi algorithm
Norm * P(et+1 | Xt+1) *
max (P(Xt+1 | xt) max P(x1….xt-1,xt|e1..et)) xt x1..xt-1