Statistics CourseWork
-
Upload
nedum-eluwa -
Category
Documents
-
view
3 -
download
0
description
Transcript of Statistics CourseWork
-
Contents List of Tables ........................................................................................................................................... 1
List of Figures .......................................................................................................................................... 2
Part One .................................................................................................................................................. 3
1.1. Daily Rainfall ........................................................................................................................ 3
1.2. Monthly Rainfall .................................................................................................................. 4
1.3. Annual Rainfall .................................................................................................................... 4
Part Two .................................................................................................................................................. 5
Part Three ............................................................................................................................................... 9
References ............................................................................................................................................ 12
List of Tables Table 1: Estimated Percentile values for Exponential Distribution ( 4.1) ....................................... 3
Table 2: Estimations of Distribution Fits for Daily Rainfall Data .......................................................... 3
Table 3: Estimated Percentile values for Gamma Distribution ( 2.8; 25.5) ............................... 4
Table 4: Estimated Percentile values for Normal Distribution ( 851; 117) ................................ 4
Table 5: Model Summary of SAAR vs Elevation Regression ................................................................. 5
Table 6: Coefficients of SAAR vs Elevation Regression ......................................................................... 5
Table 7: Best combinations of SAAR regression variables .................................................................... 6
Table 8: Model Summary of revised SAAR regression .......................................................................... 7
Table 9: Coefficients of revised SAAR regression model ...................................................................... 7
Table 10: Model application results from ungauged site ..................................................................... 8
Table 11: Trend model details and uncertainty parameters .............................................................. 12
-
2
List of Figures Figure 1: Top - Histogram of Daily Rainfall ............................................................................................. 3
Figure 2: Top - Histogram of Monthly Rainfall ........................................................................................ 4
Figure 3: Top - Histogram of Annual Rainfall .......................................................................................... 4
Figure 4: Scatterplot of Regression relation (SAAR versus Elevation) .................................................... 5
Figure 5: Residual Plots of SAAR Regression with Elevation ................................................................... 6
Figure 6: Residual plots of refined SAAR regression model .................................................................... 7
Figure 7: Correlogram of Daily Rainfall in the Eden Catchment (30-day lag) ......................................... 9
Figure 8: Sample time series plot of Daily Rainfall Data . ....................................................................... 9
Figure 9: Autocorrelation (12 month lag) plot for Monthly Rainfall in the Eden Catchment ................. 9
Figure 10: Autocorrelation (10 year lagged) of Annual rainfall in the Eden Catchment. ..................... 10
Figure 11: Autocorrelation (12 month lagged) for mean monthly flows ............................................. 10
Figure 12: Correlogram of deseasonalised monthly flow data (12 month lagged) .............................. 10
Figure 13: Mean Annual Temperatures for Central England for 1659 - 2011 ...................................... 11
Figure 14: Mean Annual Temperature of Central England (1701 - 1800) ............................................ 11
Figure 15: Mean Annual Temperature of Central England (1801 - 1900) ............................................ 12
Figure 16: Mean Annual Temperature of Central England (1901 - 2000) ............................................ 12
-
3
Part One This part is aimed at understanding the various distributions that estimate the likelihood of given
probabilistic events. The stochastic nature of precipitation events present clear examples of events
that need to be estimated from some established probability distribution.
The objective of this part is to analyse various rainfall frequencies from the Eden Catchment over a
40 year period to determine general shapes of individual probability density curves. For each curve,
the properties and resultant estimated catchment parameters would be presented as well.
1.1. Daily Rainfall It is evident from the plot of relative frequencies of the various daily rainfall occurrences that the
exponential distribution estimates the daily rainfall depth reasonably. The cumulative distribution
function also fits the distribution.
The daily distribution has an approximated mean () of 4.1mm. For the given record length, there
were 8239 wet days. The depth of rainfall
on these days varied between 0.1mm and
79mm. This wide range complicates the
identification of distribution fits for daily
rainfall data, especially for estimations of
rainfall depths close to zero. First
comparison assessments on Minitab with
given candidate distributions (Normal,
Exponential, 3-parameter lognormal and
Gamma) produced Table 2.
It was noticed that no particular
distribution immediately fit the data
accurately (as shown by p values < 0.005).
However, a reduction in the range of values
caused by raising the calculation threshold
significantly reduced the Anderson-Darling
(AD) statistic for the exponential
distribution alone. Additionally, visual
comparison confirmed this selection.
Table 1: Estimated Percentile values for Exponential Distribution ( 4.1)
Table 2: Estimations of Distribution Fits for Daily Rainfall Data
Distribution No Threshold (All non-zero values) Threshold (Values > 0.4mm)
AD P LRT P AD P LRT P
Normal 675.409 < 0.005 511.455 < 0.005
Exponential 238.213 < 0.003 58.677 < 0.003
3-Parameter Lognormal 60.101 * 0.000 35.517 * 0.000
Gamma 54.904 < 0.005 67.735 < 0.005
P( X x ) x (mm)
0.10 (10th Percentile) 0.4
0.50 (50th Percentile) 2.8
0.90 (90th Percentile) 9.4
0.99 (99th Percentile) 18.7
Figure 1: Top - Histogram of Daily Rainfall; Bottom Cumulative Distribution Plot
-
4
1.2. Monthly Rainfall The monthly rainfall during the record
length in the Eden catchment had 469
observations ranging between 0.9 and
228.5 mm. The frequency distribution and
cumulative distribution plot is shown
below. Selection of the best-fitting
distribution followed the procedure for
daily rainfall above. Of the candidate
distributions tested, the Gamma
distribution estimated the monthly data
best (with AD = 0.472 and p > 0.250). The
shape () and scale () of the Gamma
distribution are given (approximately) as
2.8 and 25.5 respectively. These values
combine () to give a mean monthly
rainfall for the record length of 71.2mm
and an approximated standard deviation
(2)0.5 of 43mm.
`Table 3: Estimated Percentile values for Gamma Distribution ( 2.8; 25.5)
P( X x ) x (mm)
0.10 (10th Percentile) 24.9 0.50 (50th Percentile) 62.9 0.90 (90th Percentile) 128.3 0.99 (99th Percentile) 205.2
1.3. Annual Rainfall The annual rainfall as expected
approximately followed a normal
distribution. Descriptive parameters of the
estimated curve are the mean ()
approximately 851mm and standard
deviation () approximately 117mm for 39
years. Best fit selection process followed as
above. Table 4: Estimated Percentile values for Normal Distribution ( 851; 117)
P( X x ) x (mm)
0.10 (10th Percentile) 701.2 0.50 (50th Percentile) 851.3 0.90 (90th Percentile) 1001.4 0.99 (99th Percentile) 1123.7
1080960840720600
16
14
12
10
8
6
4
2
0
Annual Rainfall (mm)
Fre
qu
en
cy
Histogram of Annual Rainfall (mm)Normal Distribution Fit
Figure 3: Top - Histogram of Annual Rainfall; Bottom - Cumulative Distribution Plot of Annual Rainfall
Figure 2: Top - Histogram of Monthly Rainfall; Bottom - Cumulative Distribution Plot of Monthly Rainfall
-
5
Part Two Part One somewhat highlighted the variation of rainfall at various temporal scales. This part aims at
defining relationships between Standard Annual Average Rainfall (SAAR) and geospatial variables in
the Eden Catchment. To achieve this, regression analysis would be used to test the dependence and
the resulting model would be used to predict a possible scenario (within stated margins of
uncertainty) given a specific location.
In this case, the predictor variables given for the analysis are Elevation (Elev), Easting (E) and
Northing (N). First glances at the catchments Digital Elevation and Interpolated Annual Rainfall maps
suggest some correlation, especially in the lower lying areas of the catchment. (See Appendix).
Similar spatial variation is also evident in the steady decrease in rainfall with Northward progress,
but few difficulties arise in visual East West estimations.
Initial regression of SAAR with Elevation
produced the following model:
Equation 1: Regression Model of SAAR with Elevation
SAAR (mm) = 523.8 + 2.565 Elevation (m)
Interpretation of the model results
presented in Table 5 and Table 6 show
reasonable prediction of the SAAR with
considerably small standard errors in the
coefficients (R2 = 0.716; p < 0.005).
Table 5: Model Summary of SAAR vs Elevation Regression
S (mm) R2 R2 (adjusted) PRESS R2 (predictive)
194.2 71.60% 70.46% 1143016 65.55%
Table 6: Coefficients of SAAR vs Elevation Regression Term Coef SE Coef 95% CI T-Value P-Value
Constant 523.8 90.7 (337.1, 710.5) 5.78 0.000
Elevation (m) 2.565 0.323 (1.900, 3.231) 7.94 0.000
Although the histogram of the residuals Figure 5 seem not to follow a normal distribution at visual
inspection of the histogram, analysis of the residuals give some evidence of normality at 95%
confidence (Mean of residuals = -0.01; AD = 0.483; p = 0.212). It is worthy of note that the sample
size of the distribution in question may play a major role in this seeming contradiction, as small
sample sizes usually always pass statistical normality tests (Machiwel & Jha, 2012). Nevertheless, the
normal probability plot shows points clustered about the normal line. The functional form accuracy
assumption that the residuals follow a normal distribution is thus satisfied.
The residuals also show random patterns about the centre line (and no clustering) when plotted in
order of observation. This characteristic satisfies the assumption that the residuals are not
correlated with one another.
Examination of the residuals plotted against fitted values shows an increase of variance from left to
right. This gives evidence of non-constant variance and violates the assumption of homoscedasticity.
500400300200100
2200
2000
1800
1600
1400
1200
1000
800
600
Elevation (m)
Sta
nd
ard
An
nu
al
Avera
ge R
ain
fall
(m
m)
Figure 4: Scatterplot of Regression relation (SAAR versus Elevation)
-
6
This violation affects the validity of the model. Thus, the model may require refinement. Either by
transformation of the response variable or by inclusion of other predictor variables.
Subsequent manipulation of the variables in Minitab to select the optimal (high R2, significant p
values, low errors and few variables) combination of terms produced the following summary table:
Table 7: Best combinations of SAAR regression variables
Model Summary Variable Combination
No of Variables
R2 R2 (adjusted)
R2 (predictive)
Mallows CP S (mm) Elevation
(m) Easting Northing
1 71.6 70.5 65.6 10.6 194.16 X (p=0.000)
1 54.7 52.9 46.3 30.5 245.08 X (p=0.000)
2 77.6 75.8 67.8 5.4 175.79 X (p=0.000) X (p=0.018)
2 72.6 70.3 65.5 11.4 194.71 X (p=0.000) X (p=0.363)
All 80.5 78.0 66.4 4.0 167.56 X (p=0.000) X (p=0.077) X (p=0.005)
The results (Table 7) clearly show that elevation (orographic uplift or cloud seeding) is the major
physical determinant of rainfall for this catchment as its variations are the most significant
determinant of responses in SAAR. This observation of the predominant physical activity through
statistics would assist in the interpretation of the other physical effects that generally determine wet
and dry areas within the catchment.
The combination of results also show that the model with all three variables also quite reasonably
models the responses. The added terms generally improve the models ability to fit responses to
changes in the variables (Adjusted R2 = 0.78). Comparing the coefficients, we still find very high
significance of Elevation to the overall annual rainfall model (p = 0.000). All other terms except the
Easting (p = 0.077) show high levels of significance to response fitting.
This may suggest that the Easting variable is not very useful to the model, and the model may
reproduce similar responses in SAAR without it. Indeed, the model which predicts SAAR from
Elevation and Northing alone has a higher predictive R2 value. Nevertheless, a trade-off is made for
Figure 5: Residual Plots of SAAR Regression with Elevation
-
7
fitness of model (Mallows Cp) and standard difference of the predicted results from actual
observances shown in the S (mm) values. It may thus be concluded that regression of SAAR with
Elevation, and the included variables of Easting and Northing seems practical enough to be used for
subsequent predications.
The revised regression produced the following model summarized in Table 8:
Equation 2: Revised Regression Model of SAAR
SAAR (mm) = 12633 + 2.142 Elevation (m) 0.009 Easting 0.017 Northing
Table 8: Model Summary of revised SAAR regression
S (mm) R2 R2 (adjusted) PRESS R2 (predictive)
167.559 80.54% 78.00% 1115354 66.39%
Table 9: Coefficients of revised SAAR regression model
Term Coef SE Coef 95% CI T-Value P-Value VIF
Constant 12633 3758 (4859, 20407) 3.36 0.003
Elevation (m) 2.142 0.388 (1.339, 2.945) 5.52 0.000 1.94
Easting -0.00900 0.00487 (-0.01907, 0.00107) -1.85 0.077 1.52
Northing -0.01684 0.00549 (-0.02820, -0.00548) -3.07 0.005 1.88
Details of the model in Table 9 show that average annual rainfall within in the catchment increases
(positive coefficients) with higher progress towards higher elevations but decreases (negative
coefficients) with progress in northward and eastward directions. Prior understanding of the
predominant effect of orographic uplift (or cloud seeding) within the catchment and visual
inspection of catchment area maps assist detecting physical patterns. The catchment maps show
highland areas on the southern and eastern boundaries with lower lying areas towards the north.
Comparing the average rainfall map with DEM map (see Appendix), the low-lying northern reaches
of the catchment receive less rainfall. However, even the highlands in the eastern boundaries get
significantly less amounts of rainfall. This corresponds with the model predictions and can be
interpreted to mean a rain shadow effect caused by the highlands in the south-west shading rain
Figure 6: Residual plots of refined SAAR regression model
-
8
laden predominant south westerly winds (Pollock, et al., 2013). Relation of results with the given
2005 rainfall map which shows intense raining in the eastern highlands may be due to enhanced
cloud seeding during a convective storm.
Figure 6 shows residual plots which test the validity of the refined linear model. The residuals clearly
follow a normal distribution (Mean = -0.0000; AD = 0.283; p = 0.607) and as in the first model, the
residuals show a random pattern when plotted against record order. This random pattern of
residuals against order gives evidence that errors are not correlated with one another. This statistic
is also represented in Table 9 (all VIF values relatively close to 1).
However, the residuals in this revised model still show evidence of non-constant variance. This may
be due to missing variables in the model. From previous analysis, direction of slope (aspect)
combined with elevation may give better predictions of the SAAR responses in the catchment.
Thus applying this model to an ungauged site, its shortfalls must be taken into consideration as
predictions are accurate only if the model represents the true relationship. Given such a site, with
predictor variables: Elevation 400m, Easting 380000; Northing 500000, SAAR can be estimated
as follows:
Table 10: Model application results from ungauged site Estimated SAAR SE Fit 95% CI 95% PI
1648.3 mm 65.1 (1513.7, 1782.8) (1276.4, 2020.1)
From the above table, it is predicted that the SAAR is 1648.3mm (given a set of parameters) at 95%
confidence interval. This shows that there is a 95% chance that the true mean (expected value) of
SAAR lies between 1513.7mm and 1782.8mm. On the other hand, the prediction interval gives the
range of values that are likely to contain the particular estimated value 95% of times. This interval
has a wider range of values because it seeks to predict a particular value from a range rather than
the mean of a sample (a wider set) of values from the same range. Therefore, even if the model
rightly represents the expected value of responses given a set of variables, its representation of any
particular response given the same set of variables is at best a crude estimate.
-
9
Part Three Part three focusses on the
temporal relationship of
events with themselves and
one another. The temporal
focus aids understanding of
specific processes by
investigating behaviour
through time. This
understanding is crucial in
decision making, optimized
engineering design accurate
prediction because of the
dependence of future events
on past and present events.
Statistical tools assist the
detection of time-based patterns in data
and provide methods of analysis. One of
such analytical tools is autocorrelation,
which investigates the inherent memory
or influence of a process on itself
(Machiwel & Jha, 2012). Figure 7 shows a
correlogram of daily rainfall data from the
Eden catchment, lagged at 30 days, to
understand monthly variations. The
correlogram shows strong autocorrelation
which still have significant effects few days
after. This correlation is
observed physically as the
tendency for events to
persist in occurrence. Figure
8 highlights clear evidence
of this persistence in the red
ovals that highlight dry days
following dry days or wet
days following wet days.
Monthly rainfall also shows
strong autocorrelation with
the immediately succeeding
month. However, this effect
wanes significantly after one
month. This persistence or
influence by antecedent conditions is not evident in annual autocorrelation of rainfall at 10 year lags
(Figure 10). This is usually due to changes in the environment and the dissipation of physical inertia
over time. Rainfall is generally a phenomenon that responds quickly to alterations in atmospheric
18016214412610890725436181
35
30
25
20
15
10
5
0
Time Step Index (Days)
Rain
fall
Dep
th (
mm
)
Figure 8: Sample time series plot of Daily Rainfall Data showing persistence (autocorrelation) in rainfall data.
30282624222018161412108642
1.0
0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
Lag
Au
toco
rrela
tio
n
Autocorrelation Function for Daily Rainfall(with 5% significance limits for the autocorrelations)
Figure 7: Correlogram of Daily Rainfall in the Eden Catchment (30-day lag)
121110987654321
1.0
0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
Lag
Au
toco
rrela
tio
n
Autocorrelation Function for Monthly Rainfall(with 5% significance limits for the autocorrelations)
Figure 9: Autocorrelation (12 month lag) plot for Monthly Rainfall in the Eden Catchment
-
10
conditions. It is therefore,
more likely to exert
influence over subsequent
events in its time series only
for a relatively short period
as antecedent conditions
vary rapidly.
This dependence on
antecedent conditions is also
demonstrated in the
correlogram for stream flow
time series. High flows tend
to follow high flows and low
flows have a higher chance
of succeeding low flows.
Because time series are
usually a combination of
several complex and
intricately correlated
components, it is sometimes
possible for a certain
component to mask the
detection of another
component. This masking
prevents proper
understanding of the
masked component, which
may be crucial to overall
insight into the behaviour of
the time series. A clear
example of this masking effect is the effect of seasonality component on trend component.
Stream flows are known to
follow seasonal patterns of
high and low flows.
However, other factors such
as land use variations which
are not seasonal may affect
stream flow. It is therefore
necessary to strip the stream
flow series of its seasonality
component to determine the
significance of stream flow
variation caused by other
factors.
This process of stripping is
called deseasonalisation. To
10987654321
1.0
0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
Lag
Au
toco
rrela
tio
n
Autocorrelation Function for Annual Rainfall(with 5% significance limits for the autocorrelations)
Figure 10: Autocorrelation (10 year lagged) of Annual rainfall in the Eden Catchment.
121110987654321
1.0
0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
Lag Time (12 months)
Au
toco
rrela
tio
n
Autocorrelation Function for Monthly Flow(with 5% significance limits for the autocorrelations)
Figure 11: Autocorrelation (12 month lagged) for mean monthly flows at Eden Sheepmouth (1970 - 2000)
121110987654321
1.0
0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
Lag
Au
toco
rrela
tio
n
Autocorrelation Function for Deseasonalised Monthly Flows(with 5% significance limits for the autocorrelations)
Figure 12: Correlogram of deseasonalised monthly flow data (12 month lagged)
-
11
achieve this, the difference between the observed data is standardized using the standard deviation.
This ensures that monthly variations are significantly different from seasonal variations. The formula
used for deseasonalising the data is:
=( )
: = observed flow for month; = mean for calendar month
= standard deviation for calendar month; = calendar month in question
The resulting correlogram in Figure 12 shows persistence extended only to adjacent months.
When seasonality is understood and
addressed, it is then possible to view
trends. Trend analysis is immediately
central to forecasting and projections, and
ultimately quintessential to decision making
processes which may rely on forecasts and
projections. Because forecasting and
projection models are good only if they
represent the true behaviour of the system,
the (partial duration) time series used to
detect a general trend must be
representative of the entire system. This
property is called ergodicity. The difficulty
of obtaining a representative time series is primarily due to record length limits. All behaviour which
precede the first available records can only be crudely guessed while behaviour (trends) which
succeeds record length can be predicted within reasonable uncertainty limits.
Figure 13: Mean Annual Temperatures for Central England for 1659 - 2011
Figure 14: Mean Annual Temperature of Central England (1701 - 1800)
-
12
The importance of record lengths to
developing decision support systems is
illustrated clearly in the following graph
(Figure 13) of mean annual temperature in
Central England from 1659 2011. This
temperature series has been split into three
century long partial duration series (Figure
14, Figure 15 and Figure 16).
Each partial series exhibits a unique trend
applicable only within its record length and
does not conform to the overall trend of
the entire series. This highlights the danger
of extrapolating outside the range of
predictor values. It is therefore imperative
to understand the uncertainty of the data
record period available for use and calibrate
decision support models to reflect such
unknowns accordingly.
The general upward trend of mean annual
temperatures displays the non-
homogeneity of the mean. This must either
be due to changes in the method of data
collection and/or the environment
(Machiwel & Jha, 2012). Variations in the
environment due to climate change are possible causes for this non-homogeneity.
Table 11: Trend model details and uncertainty parameters
Record Length Trend Equation Mean Absolute Percentage Error
Mean Absolute Deviation
Mean Squared Deviation
1701 1800 Y(t) = 9.31 - 0.003t 4.9% 0.44 0.34
1801 1900 Y(t) = 9.10 + 0.00036t 5.5% 0.49 0.38
1901 - 2000 Y(t) = 9.16 + 0.007t 4.1% 0.39 0.24 Complete Series Y(t) = 8.76 + 0.003t 5.3% 0.48 0.37
It must be emphasized nonetheless that the errors shown in Table 11 give error margins only for the
record length supplied to Minitab. This may mean that making extrapolations from one time window
to the next, additional error terms must be included, thus increasing uncertainty.
References Machiwel, D. & Jha, M., 2012. Hydrologic Time Series Analysis. New Delhi: Capital Publishing
Company.
Pollock, M. et al., 2013. World Meteorological Organisation. [Online]
Available at: http://www.wmo.int/pages/prog/www/IMOP/publications/IOM-116_TECO-
2014/Session%203/O3_9_Pollock_Accurate_Rainfall_measurement.pdf
[Accessed 9 December 2014].
Figure 15: Mean Annual Temperature of Central England (1801 - 1900)
Figure 16: Mean Annual Temperature of Central England (1901 - 2000)