Predicting Median Substrate
for Oregon and Washington EMAP sites
Utilizing GIS data
Julia J. Smith
December 12, 2005
Why Predict Median Substrate?
Indicator of overall stream health• Bed load transport• Stream Power• Microinvertebrate habitat• Fish habitat• How is human development
affecting a stream
What is LD50?
LD50 is a measure of median substrate.• Geometric mean of class boundaries• Log10 of the geometric means
• Several samples at each site• LD50 is the median value of
log10(geometric mean of class)
Substrate Classifications
Substrate Size (mm)
Class Geometricmean
Log10 of
geom. mean
8000-4000 Bedrock 5656.85 3.7527
4000-250 Boulders 1000.00 3.0000
250-64 Cobbles 126.49 2.1020
64-16 Gravel (coarse) 32.00 1.5052
16-2 Gravel (fine) 5.66 0.7526
2-.06 Sand 0.35 -0.4604
.06-.001 Fines 0.00775 -2.1109
Washington EPA Sites for LD50 Study
LD50 key-2.11-0.460.150.751.131.511.802.102.5533.75
Oregon EPA Sites for LD50 Study
LD50 key-2.11-1.29-0.460.150.751.131.511.802.1033.75
Geomorphic Metrics
is the total bank-full shear stresss is the density of sediment is fluid densityg is gravitational accelerationh is bank-full depthS is channel slope
D50
(s )gtc*
hS(s )tc
*
* is critical sheer stressct
-2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 1.804 2.102 2.551 3 3.753
0.00
0.05
0.10
0.15
0.20
0.25
0.30
LD50
Dis
tanc
e W
eigh
ted
Stre
am P
ower
Distance-weighted Stream Power versus LD50r = 0.327, p-value = 2.63 x 10 -12
Geomorphic Metrics
Geomorphic Metrics
-2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 1.804 2.102 2.551 3 3.753
0.00
0.05
0.10
0.15
0.20
LD50
Slo
pe
Outlet link mean slope versus LD50r = 0.214, p-value = 3.78 x 10-6
Geologic Metrics
Percent Unconsolidated Geologic type versus LD50 r = -0.246, p-value = 1.18 x 10-7
-2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 1.804 2.102 2.551 3 3.753
0.0
0.2
0.4
0.6
0.8
1.0
LD50
Per
cent
Unc
onso
lidat
ed
Climatic Metrics
-2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 1.804 2.102 2.551 3 3.753
1000
2000
3000
4000
LD50
Ave
rage
Ann
ual P
reci
pita
tion
Annual average precipitation versus LD50r = 0.199, p-value = 1.56 x 10-6
Climatic Metrics
Average annual potential evapotranspiration (mm) versus LD50 r = -0.046, p-value = 0.342
-2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 1.804 2.102 2.551 3 3.753
020
040
060
080
010
0012
0014
00
LD50
Ave
rage
Ann
ual P
oten
tial E
vapo
trans
pira
tion
Land Cover Metrics
1. Developed 2. Barren 3. Forest 4. Grasses5. Agriculture 6. Wetlands7. Open water/perennial ice and snow8. Shrubland
Land Cover Metrics
-2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 1.804 2.102 2.551 3 3.753
0.0
0.2
0.4
0.6
0.8
1.0
LD50
Per
cent
For
est
Percentage of watershed that is forest versus LD50 r = 0.19, p-value = 3.516 x 10-5
Distance-Weighted metrics
1
( )
( )
j
i
d
jj n
di
i
A eWeighted Area
A e
j represents the land cover type of concern, Aj represents the total area for land cover type j in the watershed, represents the coefficient of exponential decay, represents average distance from outlet for land cover of type j n represents the total number of the land cover types jd
Additional Land Cover Metrics
Buffered Metrics – Buffered within a measure of the stream (30 meters, 100 meters, 300 meters)
Buffered and Distance-weighted metrics
Goals
Predict LD50 without visiting sites Small number of predictors for
scientifically sensible model
Methods-Stepwise Variable Selection
Multiple Linear Regression Top-in-tier models Top geomorphic models plus one from
each of the remaining tiers
Akaike’s Information Criterion
log 2( 2)RSS
N pN
N observationsp predictors
RSS is the sum of squared residuals
AIC in stepwise variable selection
Forward Stepwise Selection -
Method for choosing the top predictor from each tier
1. Start with the intercept model
2. Choose the variable that reduces AIC the most and include in model.
Stepwise selection in both directions-
Method chosen for choosing all top Geomorphic predictors
1. Start with full model.
2. Add and subtract variables until the model with minimum AIC is found or iteration stops.
Methods: CART Classification and Regression Trees
|DWSP2< 0.03129
snow_jan< 190.6
MENTR>=20.35
b30_l11< 0.003034
r8_l80_A>=0.0917b100_l51< 0.004057
prcp_sep< 19.05
avgt_jun>=12.58
prcp_may< 46.6
link_sa4< 0.08306
prcp_jan< 47.49
b30_r7_l30>=0.01239
mint_apr>=2.647
min_elev>=1025
-1.66
-1.03 0.69
0.565
0.941-0.823
0.298 1.49
-1.04-0.172 1.02
1.65 0.4391.49 2.01
Methods: CART Classification and Regression Trees
Predicted Response:
1
ˆ ˆ( ) 1i j
q
i j x Nj
y x a
Hybrid of Multiple Linear Regression and CART
Utilize CART on the residuals Add indicator variables to the
multiple linear regression equation for one minus the number of terminal nodes in the tree
Create new multiple regression model with variables and indicator variables
Predictive-ability Statistics
2
( )1
ˆ( )n
p i i ii
PRESS Y Y
2 1 pprediction
PRESSR
SSTO
Analysis Comparison – Top 4-tier Models
Problems with top 4-tier models Low Adjusted R2
Low Predictive Ability Over-prediction and under-prediction of fine and
bedrock substrate Non-normal residuals
Benefit of top 4-tier models Small number of predictors
Example of Non-normality of ResidualsTop 4-Tier Model
-3 -2 -1 0 1 2 3
-3-2
-10
12
Normal Q-Q Plot
Theoretical Quantiles
Sa
mp
le Q
ua
ntil
es
Analysis Comparison – Geomorphic plus Top 3-Tier Models
Problems with top geomorphic plus top 3-tier model Increase in number of variables Predictive ability still low Over-prediction and under-prediction of
fine and bedrock substrate Some collinearity between variables
Analysis Comparison – Geomorphic plus Top 3-Tier Models
Benefits with top geomorphic plus top 3-tier model Improved predictions Improved normality of residuals
Comparison of Analysis - CART
Problems with CART Low predictive-ability Predicts several observed substrate sizes in
one node Over-prediction and under-prediction of fines and
bedrock substrate Omitting one site creates different tree
Benefits of CART Simple analysis Missing variables not an issue
CART Predictions
-2 -1 0 1 2 3 4
-2-1
01
23
4
Observed LD50 Values
LD
50
CA
RT
Pre
dic
tion
s
Comparison of Analysis-Hybrids
Problems with hybrid models Increased number of variables Collinearity with introduction of node
indicator variables Non-normal residuals
Comparison of Analysis-Hybrids
Benefit of hybrid models Residuals closer to normal Increased predictive-ability Explains some of the variation created
by fitting a linear model to ordinal data
One example: Residual Tree forHybrid Geomorphic plus Top 3-Tier Model
Most promising multiple regression prediction model: Geomorphic plus top 3-tier
Response Adjusted R2
PRESSpfor LD50
MSPR
LD50 0.362 504.802 1.274 0.319
2predictionR
One example: Residual Tree forHybrid Geomorphic plus Top 3-Tier Model
|slp_elon< 0.3566
out_sa< 0.008686
CVENTR>=0.1489
out_sa>=0.004734
link_slope>=0.002764
topo_wet>=8.152
shed_slp>=14.97
link_sa< 0.0431
link_sa>=0.08093
b30_r5_l42>=0.929
CVCON>=0.4208 b30_r5_l42< 0.5441
CVCON>=0.4342 avgt_jun< 12.32
b30_r5_l42>=0.759
slp_elon< 0.5467 MENTB>=15.63
-0.8348
-1.1 -0.1191
0.6496
-0.6906
-0.6472
-0.8996 -0.09977
-0.9114 0.2462
-0.97080.0004686
-0.2892 0.4309
0.581
0.4488
0.7804
0.8367
One example: Observed vs. Predicted forHybrid Geomorphic plus Top 3-Tier Model
Plot of predictions against observed LD50
-2 -1 0 1 2 3
-20
24
Observed LD50 Values
Cro
ss-v
alid
atio
n LD
50 P
redi
ctio
ns
QQ-Plot of Residuals for Hybrid Model
-3 -2 -1 0 1 2 3
-3-2
-10
12
Normal Q-Q Plot
Theoretical Quantiles
Sa
mp
le Q
ua
ntil
es
Coast Range Ecoregion
Less skewed distribution of LD50 No measurements are outliers Similar ecosystem throughout
region
Ecoregion Distributions
-3 -1 1 3
LD50
Blue Mountains
Cascades
Coast Range
Colorado Plateau
Columbia Plateau
Eastern Cascades Slopes and Foothills
Klamath Mountains
North Cascades
Northern Basin and Range
Northern Rockies
Puget Low land
Snake River Plain
Willamette Valley
leve
l.3.e
core
gion
Coast Range EMAP Sites
LD50 key-2.11-1.29-0.460.751.131.511.802.1033.75
Top 4-Tier Coast Range Model
Predictors Average aspect (climatic) Average watershed elevation (geomorphic) % watershed as volcanic geologic type
(geologic) % wetlands (distance weighted and buffered)
QQ-Plot: Top 4-Tier Coast Range
-2 -1 0 1 2
-2-1
01
2
Normal Q-Q Plot
Theoretical Quantiles
Sa
mp
le Q
ua
ntil
es
Observed versus Predicted: Top 4-Tier Coast Range Model
-2 -1 0 1 2 3
-3-2
-10
12
3
Observed LD50
Cro
ss-V
alid
ated
LD
50 P
redi
ctio
ns
Coast Range ModelTop Geomorphic Variables
1. Average watershed elevation (m) 2. Drainage density3. Mean slope within a 300-meter buffer4. Ratio of width of stream to width of floodplain5. Coefficient of average hill connectivity6. Distance to the first tributary (m)7. Percent of landscape with less than 4% slope8. Percent of landscape with less than 7% slope9. Measure of size and complexity of river10. Percent of stream as cascade11. Distance-weighted stream power 12. Watershed relief divided by its length
QQ-Plot: Coast Range Geomorphic plus Top 3-Tier model
-2 -1 0 1 2
-3-2
-10
12
Normal Q-Q Plot
Theoretical Quantiles
Sa
mp
le Q
ua
ntil
es
Observed versus Predicted: Coast Range Geomorphic + Top 3-Tier
-2 -1 0 1 2 3
-3-2
-10
12
3
Observed LD50
Cro
ss-v
alid
atio
n LD
50 P
redi
ctio
ns
CART - Coast Range Ecoregion
-2 -1 0 1 2 3 4
-2-1
01
23
4
Observed LD50 Values
CA
RT
Pre
dict
ed L
D50
Val
ues
Predictions versus Observed LD50
Coast Range: Hybrid Models
Benefits of hybrid Improved prediction Improved fit Improved normality of residuals
Problems with hybrid Increased number of predictors Collinearity with node indicator
variables
QQ-Plot:Coast Range Hybrid Top 4-Tier
-2 -1 0 1 2
-3-2
-10
12
Normal Q-Q Plot
Theoretical Quantiles
Sa
mp
le Q
ua
ntil
es
Observed versus Predicted:Coast Range Hybrid Top 4-Tier
-2 -1 0 1 2 3
-2-1
01
23
Observed LD50 Values
Cro
ss-V
alid
atio
n LD
50 P
redi
ctio
ns
QQ-Plot: Coast Range Hybrid Geomorphic plus Top 3-Tier
-2 -1 0 1 2
-2-1
01
2
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Observed versus Predicted: Coast Hybrid Geomorphic plus Top 3-Tier
-2 -1 0 1 2 3
-4-2
02
Observed LD50
Cro
ss-v
alid
atio
n LD
50 P
redi
ctio
ns
Comparison of Coast Models
2predictionRModel Adjusted R2
Top 4-tier 0.384 0.362
Geomorphic plus top-3 0.548 0.495
CART NA 0.087
Top 4-tier hybrid 0.552 0.503
Geomorphic plus top-3 hybrid 0.700 0.614
Conclusions
LD50 is difficult to predict Additional geomorphic predictors
increases prediction ability Hybrid models increase prediction
ability More success in Coast Range
Ecoregion
Future Work
Logistic Regression Ordinal data treated as continuous in
this study 12 categories might require more
sophisticated methods
Spatial Analysis Appears to be spatial correlation in
distribution of LD50
Top Related