Optimal Sensor Scheduling via Classification Reduction of Policy Search (CROPS)
description
Transcript of Optimal Sensor Scheduling via Classification Reduction of Policy Search (CROPS)
Optimal Sensor Scheduling via Optimal Sensor Scheduling via Classification Reduction ofClassification Reduction of
Policy Search (CROPS)Policy Search (CROPS)
ICAPS Workshop 2006ICAPS Workshop 2006
Doron Blatt and Alfred HeroDoron Blatt and Alfred Hero
University of MichiganUniversity of Michigan
Motivating Example: Landmine DetectionMotivating Example: Landmine Detection
A vehicle carries three A vehicle carries three sensors for land-mine sensors for land-mine detection, each with its detection, each with its own characteristics.own characteristics.
The goal is to optimally The goal is to optimally schedule the three schedule the three sensors for mine detection.sensors for mine detection.
This is a sequential choice This is a sequential choice of experiment problem of experiment problem (DeGroot 1970).(DeGroot 1970).
We do not know the model We do not know the model but can generate data but can generate data through experiments and through experiments and simulations.simulations.
Plastic Anti-personnel Mine
EMI GPR Seismic
Nail
Rock
Plastic Anti-tank Mine
New location
Seismic dataGPR dataEMI data
EMI GPR Seismic
EMI Seismic
Final detection Seismic dataEMI data
Seismic data Final detection
Final detection
Reinforcement LearningReinforcement LearningGeneral objective: To find optimal policies General objective: To find optimal policies for controlling stochastic decision for controlling stochastic decision processes:processes: without an explicit model.without an explicit model. when the exact solution is intractable.when the exact solution is intractable.
Applications:Applications: Sensor scheduling.Sensor scheduling. Treatment design.Treatment design. Elevator dispatching.Elevator dispatching. Robotics.Robotics. Electric power system control.Electric power system control. Job-shop Scheduling.Job-shop Scheduling.
The Optimal PolicyThe Optimal PolicyThe optimal policy satisfiesThe optimal policy satisfies
Can be found via dynamic programming:Can be found via dynamic programming:
where the policy where the policy qqt t corresponds to random action selection. corresponds to random action selection.
The Generative Model AssumptionThe Generative Model Assumption
Generative model assumption (Kearns et. al. 00’)Generative model assumption (Kearns et. al. 00’) Explicit model is unknown.Explicit model is unknown. Possible to generate trajectories by simulation or Possible to generate trajectories by simulation or
experimentexperiment
O0
O11
a0=1a0=0
O200
O3000 O3
001 O3010 O3
011 O3100 O3
101 O3110 O3
111
O201 O2
10 O211
O10
a1=0
a2=0 a2=0 a2=0 a2=0
a1=0 a1=1
a2=1a2=1
a1=1
a2=1 a2=1
M. Kearns, Y. Mansour, and A. Ng, “Approximate planning in large POMDPs via reusable trajectories,” in Advances in Neural Information Processing Systems, vol. 12. MIT Press, 2000.
Learning from Generative ModelsLearning from Generative ModelsIt is possible to evaluate the value of any policy It is possible to evaluate the value of any policy from trajectory trees: from trajectory trees:
Let be the sum of rewards on the path that agrees with policy Let be the sum of rewards on the path that agrees with policy on on the the iith tree. Then,th tree. Then,
O0
O11
a0=1a0=0
O200
O3000 O3
001 O3010 O3
011 O3100 O3
101 O3110 O3
111
O201 O2
10 O211
O10
a1=0
a2=0 a2=0 a2=0 a2=0
a1=0 a1=1
a2=1a2=1
a1=1
a2=1 a2=1
Three sources of error in RLThree sources of error in RLMisallocation of approximation resources to state space: without knowing the optimal policy one cannot sample from the distribution that it induces on the stochastic system’s state spaceCoupling of optimal decisions at each stage: finding the optimal decision rule at a certain stage hinges on knowing the optimal decision rule for future stagesInadequate control of generalization errors: without a model ensemble averages must be approximated from training trajectories
J. Bagnell, S. Kakade, A. Ng, and J. Schneider, “Policy search by dynamic programming,” in Advances in Neural Information Processing Systems, vol. 16. 2003.
A. Fern, S. Yoon, and R. Givan, “Approximate policy iteration with a policy language bias,” in Advances in Neural Information Processing Systems, vol. 16, 2003.
M. Lagoudakis and R. Parr, “Reinforcement learning as classification: Leveraging modern classifiers,” in Proceedings of the Twentieth International Conference on Machine Learning, 2003.
J. Langford and B. Zadrozny, “Reducing T-step reinforcement learning to classification,” http://hunch.net/ jl/projects/reductions/reductions.html, 2003.∼
M. Kearns, Y. Mansour, and A. Ng, “Approximate planning in large POMDPs via reusable trajectories,” in Advances in Neural Information Processing Systems, vol. 12. MIT Press, 2000.
S. A. Murphy, “A generalization error for Q-learning,” Journal of Machine Learning Research, vol. 6, pp. 1073–1097, 2005.
Learning from Generative ModelsLearning from Generative Models
Drawbacks:Drawbacks: The combinatorial optimization problem:The combinatorial optimization problem:
can only be solved for small can only be solved for small nn and small and small ..
Our remedies:Our remedies: Break the multi-stage search problem into a sequence of Break the multi-stage search problem into a sequence of
single-stage optimization problems.single-stage optimization problems. Use a convex surrogate to simplify each optimization Use a convex surrogate to simplify each optimization
problem.problem.
Will obtain generalization bounds similar to Will obtain generalization bounds similar to (Kearns…,’00) but that (Kearns…,’00) but that apply to the case in which the decision rules are estimated sequentially by reduction to classification
Fitting the Hindsight PathFitting the Hindsight Path
Zadrozny & Langford 2003: on each tree find the reward Zadrozny & Langford 2003: on each tree find the reward maximizing path.maximizing path.
Fit T+1 classifiers to these paths.Fit T+1 classifiers to these paths.Driving the classification error to zero is equivalent to finding Driving the classification error to zero is equivalent to finding the optimal policy.the optimal policy.Drawback: In stochastic problems, no classifier can predict the Drawback: In stochastic problems, no classifier can predict the hindsight action choices.hindsight action choices.
O0
O11
a0=1a0=0
O200
O3000 O3
001 O3010 O3
011 O3100 O3
101 O3110 O3
111
O201 O2
10 O211
O10
a1=0
a2=0 a2=0 a2=0 a2=0
a1=0 a1=1
a2=1a2=1
a1=1
a2=1 a2=1
Our Approximate Dynamic Programming ApproachOur Approximate Dynamic Programming ApproachAssume the policy class has the form:Assume the policy class has the form:
Estimating Estimating TT via tree pruning: via tree pruning:
This is the empirical equivalent of:This is the empirical equivalent of:
Call the resulting policy Call the resulting policy
O0
a0=0
O3010 O3
011
O201
O10
a2=0 a2=1
a1=1
Choose random actions
Solve single-stage RL problem
Our Approximate Dynamic Programming ApproachOur Approximate Dynamic Programming ApproachEstimating Estimating T-1T-1 given via tree pruning: given via tree pruning:
This is the empirical equivalent of:This is the empirical equivalent of:
O0
a0=0
O200
O3000 O3
011
O201
O10
a1=0
a2=0 a2=1
a1=1
Choose random actions
Solve single-stage RL problem
Propagate rewards according to
Our Approximate Dynamic Programming ApproachOur Approximate Dynamic Programming Approach
Estimating Estimating T-2T-2==00 given and via tree pruning: given and via tree pruning:
This is the empirical equivalent of:This is the empirical equivalent of:
Solve single-stage RL problem
O0
O11
a0=1a0=0
O3011 O3
101
O201 O2
10
O10
a1=0
a2=1
a1=1
a2=1
Propagate rewards according to
Reduction to Weighted ClassificationReduction to Weighted ClassificationOur approximate dynamic programming algorithm converts the multi-stage Our approximate dynamic programming algorithm converts the multi-stage optimization problem into a sequence of single-stage optimization problems. optimization problem into a sequence of single-stage optimization problems. Unfortunately each sequence is still a combinatorial optimization problem. Unfortunately each sequence is still a combinatorial optimization problem. Our solution: reduce this to learning classifiers with convex surrogate.Our solution: reduce this to learning classifiers with convex surrogate.This classification reduction is different from previous workThis classification reduction is different from previous work
Consider a single-stage RL problem:Consider a single-stage RL problem:
Consider a class of real valued functions Consider a class of real valued functions Each induces a policy:Each induces a policy:We would like to maximizeWe would like to maximize
O0
O11
a0=1a0=-1
O1-1
Reduction to Weighted ClassificationReduction to Weighted ClassificationNote that Note that
Therefore, solving a single stage RL problem is equivalent to:Therefore, solving a single stage RL problem is equivalent to:
where where
Reduction to Weighted ClassificationReduction to Weighted Classification
It is often much easier to solveIt is often much easier to solve
where where is a convex function. is a convex function.For example:For example: In neural network training In neural network training is the truncated quadratic loss. is the truncated quadratic loss. In boosting In boosting is the exponential loss. is the exponential loss. In support vector machines In support vector machines is the hinge loss. is the hinge loss. In logistic regression In logistic regression is the scaled deviance. is the scaled deviance.
The effect of introducing The effect of introducing is well understood for the is well understood for the classification problem and the results can be applied classification problem and the results can be applied to the single-stage RL problem as well.to the single-stage RL problem as well.
Reduction to Weighted ClassificationReduction to Weighted ClassificationMulti-Stage ProblemMulti-Stage Problem
Let Let be the policy estimated be the policy estimated by the approximate dynamic programming algorithm, where by the approximate dynamic programming algorithm, where each single-stage RL problem is solved via each single-stage RL problem is solved via minimization. minimization.
Theorem 2: Assume P-dim( ) = dt, t=0, …, T. Then, with Theorem 2: Assume P-dim( ) = dt, t=0, …, T. Then, with probability greater than 1-probability greater than 1- over the set of trajectory trees, over the set of trajectory trees,
for n satisfyingfor n satisfying
Proof uses recent results in Proof uses recent results in P. L. Bartlett, M. I. Jordan, and J. D. McAulie, “Convexity, classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138–156, March 2006.
Tighter than analogous Q-learning bound (Murphy:JMLR2005). Tighter than analogous Q-learning bound (Murphy:JMLR2005).
Application to Landmine Sensor SchedulingApplication to Landmine Sensor Scheduling
A sand box experiment was A sand box experiment was conducted by Jay Marble to conducted by Jay Marble to extract features of the three extract features of the three sensors for different types of sensors for different types of land-mines and clutter.land-mines and clutter.
Based on the results the Based on the results the sensors’ outputs were sensors’ outputs were simulated as a Gaussian simulated as a Gaussian mixture.mixture.
Feed forward neural networks Feed forward neural networks were trained to perform both were trained to perform both the classification task and the the classification task and the weighted classification talks.weighted classification talks.
Performance where evaluated Performance where evaluated on a separate data set.on a separate data set.
Plastic Anti-personnel Mine
EMI GPR Seismic
Nail
Rock
Plastic Anti-tank Mine
New location
Seismic dataGPR dataEMI data
EMI GPR Seismic
EMI Seismic
Final detection Seismic dataEMI data
Seismic data Final detection
Final detection
Always deploy three sensors
Always deploy best of two sensors: GRP + Seismic
Always deploy best single sensor: EMI
+
Incr
easin
g se
nsor
deplo
ymen
t cos
t+++
+
+
+
+
Performance obtained by randomized sensor allocation
Performance obtained by optimal sensor scheduling
Reinforcement Learning for Sensor Scheduling Reinforcement Learning for Sensor Scheduling Weighted Classification ReductionWeighted Classification Reduction
Object TypeObject Type
11 22 33 44 55 66 77 88 FeatureFeature
M-ATM-AT M-APM-AP P-ATP-AT P-APP-AP Cltr-1Cltr-1 Cltr-2Cltr-2 Cltr-3Cltr-3 BkgBkg DescriptionDescription
EMIEMI(1)(1)
HighHigh HighHighMediuMediumm HighHigh HighHigh LowLow LowLow LowLow ConductivityConductivity
SensoSensorr HighHigh HighHigh HighHigh
MediuMediumm
MediuMediumm LowLow LowLow LowLow SizeSize
GPRGPR(2)(2)
HighHighMediuMediumm HighHigh
MediuMediumm LowLow LowLow LowLow LowLow DepthDepth
HighHighMediuMediumm HighHigh
MeduMedumm HighHigh HighHigh HighHigh LowLow RCSRCS
SeismiSeismic (3)c (3) HighHigh
MediuMediumm HighHigh
MediuMediumm
MediuMediumm
MediuMediumm LowLow LowLow ResonanceResonance
Policy for specific scenarios:
23D
Optimal sequence for mean state
21D
23D
213D
23D
23D
23D
23D
Optimal Policy for Mean StatesOptimal Policy for Mean States
Application to waveform selection: Landsat Application to waveform selection: Landsat MSS ExperimentMSS Experiment
Data consists of 4435 training cases and 2000 test cases. Data consists of 4435 training cases and 2000 test cases. Each case is a 3x3x4 image stack in 36 dimensions having 1 class attributeEach case is a 3x3x4 image stack in 36 dimensions having 1 class attribute
(1) Red soil, (2) Cotton, (3)Vegetation stubble, (4) Gray soil, (5) Damp gray soil, (6)Very damp gray soil(1) Red soil, (2) Cotton, (3)Vegetation stubble, (4) Gray soil, (5) Damp gray soil, (6)Very damp gray soil
• For each image location we adopt two stage policy to classify its label: • Select one of 6 possible pairs of 4 MSS bands for initial illumination• Based on initial measurement either:
• Make final decision on terrain class and stop• Illuminate with remaining two MSS bands and make final decision
• Reward is average probability of correct decision minus stopping time (energy)
Waveform Scheduling: CROPSWaveform Scheduling: CROPS
New location
Bands (2,4)Bands (2,3)
Classify
Bands (3,4)Bands (1,2) Bands (1,3) Bands (1,4)
Bands (1,4)
Classify
Reward=I(correct)
Reward=I(correct)-c
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20.095
0.1
0.105
0.11
0.115
0.12
0.125
0.13
0.135
C = 0.03
C = 0
C = 0.06C = 0.09
number of dwells
Pe
C = 0
C = 0.02
C = 0.04C = 0.06
C = 0.08
C = 0.1
C = 0.12
C = 0.14
C = 0.16
C = 0.18
Neural Networkk-Nearest Neighbors
Reinforcement Learning for Sensor Scheduling Reinforcement Learning for Sensor Scheduling Weighted Classification ReductionWeighted Classification Reduction
Best myopic initial pair: (1,2)
Non-myopic initial pair: (2,3)
* C is the cost of using the additional two bands.
Performance with all four bands
LANDSAT data: total of 4 bands, each produce a 9 dimensional vector.
Performance of all four bands
Sub-band optimal schedulingSub-band optimal schedulingOptimal initial sub-bands are 1+2 Optimal initial sub-bands are 1+2
Clutter Clutter
typetype11 22 33 44 55 66 PcPc
Performance Performance of sub-bands of sub-bands 1+21+2
0.980.98 0.850.85 0.960.96 00 0.60.6 0.940.94 0.8060.806
Performance Performance of all sub-of all sub-bandsbands
0.970.97 0.950.95 0.920.92 0.540.54 0.840.84 0.820.82 0.8620.862
Policy Policy
statisticsstatistics
Use full Use full spectrum spectrum only 60% only 60% of timeof time
Performance Performance of optimal of optimal schedulerscheduler
0.980.98 0.940.94 0.930.93 0.510.51 0.840.84 0.820.82 0.8610.861
*
*
Cla
ssify
Add
ition
al
band
s
ConclusionsConclusions
Elements of CROPSElements of CROPS Gauss-Seidel-type DP approximation reduces multi-Gauss-Seidel-type DP approximation reduces multi-
stage to sequence of single-stage RL problemsstage to sequence of single-stage RL problems Classification reduction is used to solve each of these Classification reduction is used to solve each of these
signal stage RL problemssignal stage RL problems
Obtained tight finite sample generalization error Obtained tight finite sample generalization error bounds for RL based on classification theorybounds for RL based on classification theory
CROPS methodology illustrated for energy CROPS methodology illustrated for energy constrained landmine detection and waveform constrained landmine detection and waveform selectionselection
PublicationsPublications
Blatt D., “Adaptive Sensing in Uncertain Environments ,” Blatt D., “Adaptive Sensing in Uncertain Environments ,” PhD Thesis, Dept EECS, University of Michigan, 2006.PhD Thesis, Dept EECS, University of Michigan, 2006.
Blatt D. and Hero A. O., "From weighted classification to Blatt D. and Hero A. O., "From weighted classification to policy search", Nineteenth Conference on Neural Information policy search", Nineteenth Conference on Neural Information Processing Systems (NIPS), 2005.Processing Systems (NIPS), 2005.
Kreucher C., Blatt D., Hero A. O., and Kastella K., ``Adaptive Kreucher C., Blatt D., Hero A. O., and Kastella K., ``Adaptive multi-modalitysensor scheduling for detection and tracking of multi-modalitysensor scheduling for detection and tracking of smart targets'', Digital Signal Processing, 2005.smart targets'', Digital Signal Processing, 2005.
Blatt D., Murphy S.A., and Zhu J. "A-learning for Blatt D., Murphy S.A., and Zhu J. "A-learning for Approximate Planning",Approximate Planning", Technical Report 04-63, The Technical Report 04-63, The Methodology Center, Pennsylvania State University.Methodology Center, Pennsylvania State University. 2004. 2004.
Simulation DetailsSimulation DetailsDimension reduction: PCA subspace explaining 99.9% (13-18D)Dimension reduction: PCA subspace explaining 99.9% (13-18D)
sub-bands sub-bands Dim Dim --------- --------- --- --- 1+2 1+2 13 13 1+3 1+3 17 17 1+4 1+4 17 17 2+3 2+3 15 15 2+4 2+4 15 15 3+4 3+4 15 15 1+2+3+4 1+2+3+4 1818
State at time t: projection of collected data onto PCA subspace. State at time t: projection of collected data onto PCA subspace. Policy search:Policy search:
Weighted classification building block: Weighted classification building block: Weights sensitive combination of [5,2] and [6,2] [tansig, logsig] NN.Weights sensitive combination of [5,2] and [6,2] [tansig, logsig] NN.
Label classifer:Label classifer: Unweighted classification building block: Unweighted classification building block:
Combination of [5,6] and [6,6] [tansig, logsig] feed forward NN.Combination of [5,6] and [6,6] [tansig, logsig] feed forward NN.
Training used 1500 trajectories for label classifiers and 2935 trajectories for policy Training used 1500 trajectories for label classifiers and 2935 trajectories for policy searchsearch
Adaptive length gradient learning with momentum termAdaptive length gradient learning with momentum term Reseeding applied to avoid local minimaReseeding applied to avoid local minima
Performance evaluation using 2000 trajectories.Performance evaluation using 2000 trajectories.
Sub-band performance matrixSub-band performance matrix11 22 33 44 55 66 PcPc
1+21+2 0.980.98 0.850.85 0.960.96 00 0.60.6 0.940.94 0.8060.806
1+31+3 0.900.90 0.840.84 0.910.91 0.550.55 0.560.56 0.80.8 0.7960.796
1+41+4 0.960.96 0.930.93 0.920.92 0.480.48 0.560.56 0.760.76 0.8030.803
2+32+3 0.910.91 0.940.94 0.840.84 0.560.56 0.650.65 0.820.82 0.8120.812
2+42+4 0.900.90 0.920.92 0.90.9 0.180.18 0.760.76 0.870.87 0.8050.805
3+43+4 0.860.86 0.920.92 0.760.76 0.50.5 0.420.42 0.790.79 0.7390.739
AllAll 0.970.97 0.950.95 0.920.92 0.540.54 0.840.84 0.820.82 0.8620.862
SBCLT
Best myopic choice.
Best non-myopic choice when likely to take more than one observation.