1 Jerry Tsai [email protected] This presentation available at: clintuition.com/pubs
-
Upload
steve-winey -
Category
Documents
-
view
219 -
download
1
Transcript of 1 Jerry Tsai [email protected] This presentation available at: clintuition.com/pubs
22
Optimal Model Search By a Genetic Algorithm Using SAS®
Jerry Tsai
33
Problem Statement
n observations; p possible predictors n >> p >> 0 2p possible subsets of
the set of predictors The challenge: Choose a subset of the
possible predictors that has the greatest predictive ability relative to its size
44
Problem Definition
What do statisticians call this problem? “Subset selection” Finding the “best (predictive) model” Finding a “parsimonious model”
How do statisticians approach this problem? Conduct a search through a space defined by
the 2p possible combinations of the p parameters to find a subset of those parameters that optimizes an objective function
55
Reasons to Search for an Optimal Model
1. To describe the relative importance of variables
2. To save money in data collection and management
3. To enhance predictive ability But we should make very sure it is
worth the effort Inappropriate for estimation and
hypothesis testing Time-consuming
66
Commonly-Known Search Heuristics
Forward; backward; stepwise Found in REG, LOGISTIC, PHREG, more
LAR (least angle regression) LASSO (least absolute shrinkage and
selection operator) Both found in GLMSELECT
All of these heuristics use an incremental approach when searching for an optimal model
77
Incremental Approach
To a set, add or subtract one variable at a time
Include or exclude a candidate variable if: The variable meets entry and stopping criteria
OR The set of variables with the candidate variable
added better optimizes the objective function
88
Holistic Approach
Assess a set of variables as a whole Sets of variables are compared to one another Each element (variable) of the set is treated
equally Disadvantage: less “helpful” elements of
the set are treated the same as more “helpful” elements of the set
Advantage: May uncover synergism or confounding among variables
99
Advantage of a Non-incremental Approach
The absolute optimum may be undiscoverable through a incremental approach, due to: Confounding Endogeneity Nonlinearity (with respect to a link function)
Space searched could be much greater Forward selection: O(p2)
limp → ∞
O(p2)
2p= 0
and this expression quickly converges
1010
Advantage of Using Regression
Statisticians are very familiar with generalized linear models (GLMs)
Parameter estimates are amenable to comprehensible interpretation
1111
Genetic Algorithm Implementation
Create a generation of sets of variables (a set of sets)
Score all sets in a generation Sets that score higher are selected for
reproduction These selected sets are recombined and
mutated to yield additional sets. These additional sets will constitute a new
generation that will in turn undergo scoring, selection, and recombination.
1212
Why Use a Genetic Algorithm?
Examples from nature suggest local optima are eventually found
A holistic approach allows variables to be assessed simultaneously
The search covers a much larger area than traditional incremental approaches
1313
Implementation
The presence (or absence) of each variable in a set is represented by a bit
A string of bits together represent a chromosome of bits
So each chromosome represents a subset of the possible predictors
1414
Implementation Illustration
12 possible parameters alfa, bravo, charlie, delta…kilo, lima
Representation example: The variables bravo, charlie, and kilo
constitute a subset (i.e., constitute a model)0110 0000 0010abcd efgh ijkl
1515
Genetic Operation – Mutation
Logically negate bits within a chromosome (point mutation) 0 becomes 1; 1 becomes 0
1616
Implementation Illustration
Assume 12 possible parameters alfa, bravo, charlie, delta…kilo, lima
Example: bravo, charlie, and kilo are in the model, all
other variables are not 0110 0000 00100110 0000 0010abcd efgh ijkl
1717
Mutation Example
0110 0000 0010
1818
Mutation Example
0110 0000 0010
1919
Mutation Example
Randomly selected for mutation
0110 0000 0010
bravo echo lima
2020
0010 1000 0011
Mutation Example
bravo echo lima
Randomly selected for mutation
2121
Genetic Operation – Mutation
Logically negate random bits within a chromosome (point mutation)
0 becomes 1; 1 becomes 0 Example: {bravo; charlie; kilo}; MUTATE(bravo; echo; lima)
0010 1000 0011
2222
Genetic Operation – Crossover
Two chromosomes exchange genetic information (Morgan 1916)
2323
Crossover Example
0110 0000 00100100 1000 0001
{bravo; charlie; kilo}
{bravo; echo; lima}
2424
Crossover Example
0110 0000 00100100 1000 0001
{bravo; charlie; kilo}
{bravo; echo; lima}
2525
Crossover Example
0110 0000 00100100 1000 0001
{bravo; charlie; kilo}
{bravo; echo; lima}
2626
Crossover Example
0110 0000 00100100 1000 0001
0110 0000 00010100 1000 0010
{bravo; charlie; kilo}
{bravo; echo; lima}
{bravo; charlie; lima}
{bravo; echo; kilo}
2727
Genetic Operation – Crossover
Two chromosomes exchange genetic information (Morgan 1916)
Example: CROSSOVER[{bravo; charlie; kilo}; {bravo;
echo; lima}; @ foxtrot]
0110 0000 00100100 1000 0001
0110 0000 00010100 1000 0010
2828
Genetic Algorithm - Main Steps
Initialize Set up environment Create starting generation
Evaluate (i.e., score) Chromosomes (i.e., individuals) Generation
Report, interim Select (i.e., choose which individuals reproduce) Reproduce (i.e., create new generation)
Apply genetic operators
2929
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
3030
Initialize
Clear environment Initialize parameters
Create &&VAR&I macro variables from the list of possible parameters
Evaluate and store minimum (aka null) model
Evaluate and store maximum (aka full) model
Initialize parents (create starting generation)
3131
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
3232
Evaluate
Individual (chromosomes) If a chromosome has a score saved, assign
that score to the chromosome Otherwise, evaluate the chromosome on its
fitness for reproduction Save scores for newly-evaluated chromosomes
Generation (of chromosomes) Evaluate and store historical information on
the characteristics of the generation, e.g., the mean score.
3333
Scores
Evaluate each chromosome by computing the value of these functions: Objective function = the function to be
optimized Reward greater predictive ability while
penalizing any increase in the number of parameters
e.g., Akaike’s Information Criterion (AIC)
Fitness function A function based on the objective function
that determines the probability of a chromosome being selected for reproduction.
3434
SAS® Code Evaluation Illustration
proc anly-proc data = input-data-set <options>; model %do i = 1 to %cntvars.; %if %substr(&bitstrg., &i., 1) = 1 %then %do; &&var&i.. %end; %end; </ options>; <other statements>;run;
p = # of possible parameters
chromosome
variable(s)
3535
SAS® Code Comments
You will very likely create output data sets from the PROC– through the use of ODS statements, OUTPUT statements, or an output option on the MODEL statement– to obtain statistics that will constitute your objective function and fitness function scores.
I actually use a modified version of my %ITERLIST macro (Tsai, WUSS 2008) to create the list of variables in the MODEL statement.
3636
Flow Chart
Report,
Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
3737
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
3838
Evaluate Escape Criterion
You need to specify a condition to escape the loop… if you want to algorithm to terminate
Escape criteria examples: Mean score for a particular generation fails to
exceed any of those for a specified number of generations immediately preceding
Failure to surpass the best score seen so far within a specified number of generations
Time or resource constraints reached Minimum score surpassed
3939
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
4040
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
4141
Select
Those chromosomes with superior scores are given preference in the selection for reproduction
The method of selection is at the analyst’s discretion. One popular method used in GAs is
stochastic universal sampling
4242
Stochastic Universal Sampling
Uses a single randomly-chosen value to sample from the chromosome, choosing variables at evenly-spaced intervals across their collective fitness score
F = sum of the fitness scores for all chromosomes in a generation
N = number of chromosomes to be selected for reproduction
Wikipedia, 2009
4343
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
4444
Reproduce
Apply to selected chromosomes the genetic operations of crossover and mutation.
The resulting chromosomes constitute (in part and possibly in full) a new generation.
4545
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
4646
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
4747
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
4848
Flow Chart
Report,
Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
4949
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
5050
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report, Final
Yes
No
5151
Flow Chart
Report, Interim
Evaluate
SelectInitial-ize
Repro-duce
Escape?
Report,
Final
Yes
No
5252
Final Report
Number of generations algorithm evaluated
Mean fitness score for each generation Most optimal chromosome discovered and
its fitness and objective scores
5353
Disadvantages of using a GA
Not a built-in SAS functionality Many parameters to specify
Generation size Crossover probability Mutation rate Objective function / Fitness function
Time-consuming to run Still may not find the absolute optimum
5454
Advantages of using a GA
Deeper exploration of the model space. Allows you to remain within a familiar
paradigm (regression) with interpretable parameter coefficients
Agnostic to the regression model chosen – can use the same macro for any GLM with minor modifications
“Proven” success in the real world
55
Suggested Reading
References in paper Search heuristics
LAR and LASSO heuristics -- Robert Cohen, Peter Flom, and David Cassell
Information criteria in model selection Linear regression -- Dennis Beal Logistic and proportional hazards
regression -- Ernest Shtatland Mixed models -- Jesse Canchola and Torsten
Neilands