Statistics CourseWork

Contents List of Tables ........................................................................................................................................... 1

List of Figures .......................................................................................................................................... 2

Part One .................................................................................................................................................. 3

1.1. Daily Rainfall ........................................................................................................................ 3

1.2. Monthly Rainfall .................................................................................................................. 4

1.3. Annual Rainfall .................................................................................................................... 4

Part Two .................................................................................................................................................. 5

Part Three ............................................................................................................................................... 9

References ............................................................................................................................................ 12

List of Tables Table 1: Estimated Percentile values for Exponential Distribution ( 4.1) ....................................... 3

Table 2: Estimations of Distribution Fits for Daily Rainfall Data .......................................................... 3

Table 3: Estimated Percentile values for Gamma Distribution ( 2.8; 25.5) ............................... 4

Table 4: Estimated Percentile values for Normal Distribution ( 851; 117) ................................ 4

Table 5: Model Summary of SAAR vs Elevation Regression ................................................................. 5

Table 6: Coefficients of SAAR vs Elevation Regression ......................................................................... 5

Table 7: Best combinations of SAAR regression variables .................................................................... 6

Table 8: Model Summary of revised SAAR regression .......................................................................... 7

Table 9: Coefficients of revised SAAR regression model ...................................................................... 7

Table 10: Model application results from ungauged site ..................................................................... 8

Table 11: Trend model details and uncertainty parameters .............................................................. 12

2

List of Figures Figure 1: Top - Histogram of Daily Rainfall ............................................................................................. 3

Figure 2: Top - Histogram of Monthly Rainfall ........................................................................................ 4

Figure 3: Top - Histogram of Annual Rainfall .......................................................................................... 4

Figure 4: Scatterplot of Regression relation (SAAR versus Elevation) .................................................... 5

Figure 5: Residual Plots of SAAR Regression with Elevation ................................................................... 6

Figure 6: Residual plots of refined SAAR regression model .................................................................... 7

Figure 7: Correlogram of Daily Rainfall in the Eden Catchment (30-day lag) ......................................... 9

Figure 8: Sample time series plot of Daily Rainfall Data . ....................................................................... 9

Figure 9: Autocorrelation (12 month lag) plot for Monthly Rainfall in the Eden Catchment ................. 9

Figure 10: Autocorrelation (10 year lagged) of Annual rainfall in the Eden Catchment. ..................... 10

Figure 11: Autocorrelation (12 month lagged) for mean monthly flows ............................................. 10

Figure 12: Correlogram of deseasonalised monthly flow data (12 month lagged) .............................. 10

Figure 13: Mean Annual Temperatures for Central England for 1659 - 2011 ...................................... 11

Figure 14: Mean Annual Temperature of Central England (1701 - 1800) ............................................ 11



3

Part One This part is aimed at understanding the various distributions that estimate the likelihood of given

probabilistic events. The stochastic nature of precipitation events present clear examples of events

that need to be estimated from some established probability distribution.

The objective of this part is to analyse various rainfall frequencies from the Eden Catchment over a

40 year period to determine general shapes of individual probability density curves. For each curve,

the properties and resultant estimated catchment parameters would be presented as well.

1.1. Daily Rainfall It is evident from the plot of relative frequencies of the various daily rainfall occurrences that the

exponential distribution estimates the daily rainfall depth reasonably. The cumulative distribution

function also fits the distribution.

The daily distribution has an approximated mean () of 4.1mm. For the given record length, there

were 8239 wet days. The depth of rainfall

on these days varied between 0.1mm and

79mm. This wide range complicates the

identification of distribution fits for daily

rainfall data, especially for estimations of

rainfall depths close to zero. First

comparison assessments on Minitab with

given candidate distributions (Normal,

Exponential, 3-parameter lognormal and

Gamma) produced Table 2.

It was noticed that no particular

distribution immediately fit the data

accurately (as shown by p values < 0.005).

However, a reduction in the range of values

caused by raising the calculation threshold

significantly reduced the Anderson-Darling

(AD) statistic for the exponential

distribution alone. Additionally, visual

comparison confirmed this selection.

Table 1: Estimated Percentile values for Exponential Distribution ( 4.1)

Table 2: Estimations of Distribution Fits for Daily Rainfall Data

Distribution No Threshold (All non-zero values) Threshold (Values > 0.4mm)

AD P LRT P AD P LRT P

Normal 675.409 < 0.005 511.455 < 0.005

Exponential 238.213 < 0.003 58.677 < 0.003

3-Parameter Lognormal 60.101 * 0.000 35.517 * 0.000

Gamma 54.904 < 0.005 67.735 < 0.005

P( X x ) x (mm)

0.10 (10th Percentile) 0.4




Figure 1: Top - Histogram of Daily Rainfall; Bottom Cumulative Distribution Plot

4

1.2. Monthly Rainfall The monthly rainfall during the record

length in the Eden catchment had 469

observations ranging between 0.9 and

228.5 mm. The frequency distribution and

cumulative distribution plot is shown

below. Selection of the best-fitting

distribution followed the procedure for

daily rainfall above. Of the candidate

distributions tested, the Gamma

distribution estimated the monthly data

best (with AD = 0.472 and p > 0.250). The

shape () and scale () of the Gamma

distribution are given (approximately) as

2.8 and 25.5 respectively. These values

combine () to give a mean monthly

rainfall for the record length of 71.2mm

and an approximated standard deviation

(2)0.5 of 43mm.

`Table 3: Estimated Percentile values for Gamma Distribution ( 2.8; 25.5)

P( X x ) x (mm)

0.10 (10th Percentile) 24.9 0.50 (50th Percentile) 62.9 0.90 (90th Percentile) 128.3 0.99 (99th Percentile) 205.2

1.3. Annual Rainfall The annual rainfall as expected

approximately followed a normal

distribution. Descriptive parameters of the

estimated curve are the mean ()

approximately 851mm and standard

deviation () approximately 117mm for 39

years. Best fit selection process followed as

above. Table 4: Estimated Percentile values for Normal Distribution ( 851; 117)

P( X x ) x (mm)

0.10 (10th Percentile) 701.2 0.50 (50th Percentile) 851.3 0.90 (90th Percentile) 1001.4 0.99 (99th Percentile) 1123.7

1080960840720600

16

14

12

10

8

6

4

2

0

Annual Rainfall (mm)

Fre

qu

en

cy

Histogram of Annual Rainfall (mm)Normal Distribution Fit

Figure 3: Top - Histogram of Annual Rainfall; Bottom - Cumulative Distribution Plot of Annual Rainfall

Figure 2: Top - Histogram of Monthly Rainfall; Bottom - Cumulative Distribution Plot of Monthly Rainfall

5

Part Two Part One somewhat highlighted the variation of rainfall at various temporal scales. This part aims at

defining relationships between Standard Annual Average Rainfall (SAAR) and geospatial variables in

the Eden Catchment. To achieve this, regression analysis would be used to test the dependence and

the resulting model would be used to predict a possible scenario (within stated margins of

uncertainty) given a specific location.

In this case, the predictor variables given for the analysis are Elevation (Elev), Easting (E) and

Northing (N). First glances at the catchments Digital Elevation and Interpolated Annual Rainfall maps

suggest some correlation, especially in the lower lying areas of the catchment. (See Appendix).

Similar spatial variation is also evident in the steady decrease in rainfall with Northward progress,

but few difficulties arise in visual East West estimations.

Initial regression of SAAR with Elevation

produced the following model:

Equation 1: Regression Model of SAAR with Elevation

SAAR (mm) = 523.8 + 2.565 Elevation (m)

Interpretation of the model results

presented in Table 5 and Table 6 show

reasonable prediction of the SAAR with

considerably small standard errors in the

coefficients (R2 = 0.716; p < 0.005).

Table 5: Model Summary of SAAR vs Elevation Regression

S (mm) R2 R2 (adjusted) PRESS R2 (predictive)

194.2 71.60% 70.46% 1143016 65.55%

Table 6: Coefficients of SAAR vs Elevation Regression Term Coef SE Coef 95% CI T-Value P-Value

Constant 523.8 90.7 (337.1, 710.5) 5.78 0.000

Elevation (m) 2.565 0.323 (1.900, 3.231) 7.94 0.000

Although the histogram of the residuals Figure 5 seem not to follow a normal distribution at visual

inspection of the histogram, analysis of the residuals give some evidence of normality at 95%

confidence (Mean of residuals = -0.01; AD = 0.483; p = 0.212). It is worthy of note that the sample

size of the distribution in question may play a major role in this seeming contradiction, as small

sample sizes usually always pass statistical normality tests (Machiwel & Jha, 2012). Nevertheless, the

normal probability plot shows points clustered about the normal line. The functional form accuracy

assumption that the residuals follow a normal distribution is thus satisfied.

The residuals also show random patterns about the centre line (and no clustering) when plotted in

order of observation. This characteristic satisfies the assumption that the residuals are not

correlated with one another.

Examination of the residuals plotted against fitted values shows an increase of variance from left to

right. This gives evidence of non-constant variance and violates the assumption of homoscedasticity.

500400300200100

2200

2000

1800

1600

1400

1200

1000

800

600

Elevation (m)

Sta

nd

ard

An

nu

al

Avera

ge R

ain

fall

(m

m)

Figure 4: Scatterplot of Regression relation (SAAR versus Elevation)

6

This violation affects the validity of the model. Thus, the model may require refinement. Either by

transformation of the response variable or by inclusion of other predictor variables.

Subsequent manipulation of the variables in Minitab to select the optimal (high R2, significant p

values, low errors and few variables) combination of terms produced the following summary table:

Table 7: Best combinations of SAAR regression variables

Model Summary Variable Combination

No of Variables

R2 R2 (adjusted)

R2 (predictive)

Mallows CP S (mm) Elevation

(m) Easting Northing

1 71.6 70.5 65.6 10.6 194.16 X (p=0.000)

1 54.7 52.9 46.3 30.5 245.08 X (p=0.000)

2 77.6 75.8 67.8 5.4 175.79 X (p=0.000) X (p=0.018)

2 72.6 70.3 65.5 11.4 194.71 X (p=0.000) X (p=0.363)

All 80.5 78.0 66.4 4.0 167.56 X (p=0.000) X (p=0.077) X (p=0.005)

The results (Table 7) clearly show that elevation (orographic uplift or cloud seeding) is the major

physical determinant of rainfall for this catchment as its variations are the most significant

determinant of responses in SAAR. This observation of the predominant physical activity through

statistics would assist in the interpretation of the other physical effects that generally determine wet

and dry areas within the catchment.

The combination of results also show that the model with all three variables also quite reasonably

models the responses. The added terms generally improve the models ability to fit responses to

changes in the variables (Adjusted R2 = 0.78). Comparing the coefficients, we still find very high

significance of Elevation to the overall annual rainfall model (p = 0.000). All other terms except the

Easting (p = 0.077) show high levels of significance to response fitting.

This may suggest that the Easting variable is not very useful to the model, and the model may

reproduce similar responses in SAAR without it. Indeed, the model which predicts SAAR from

Elevation and Northing alone has a higher predictive R2 value. Nevertheless, a trade-off is made for

Figure 5: Residual Plots of SAAR Regression with Elevation

7

fitness of model (Mallows Cp) and standard difference of the predicted results from actual

observances shown in the S (mm) values. It may thus be concluded that regression of SAAR with

Elevation, and the included variables of Easting and Northing seems practical enough to be used for

subsequent predications.

The revised regression produced the following model summarized in Table 8:

Equation 2: Revised Regression Model of SAAR

SAAR (mm) = 12633 + 2.142 Elevation (m) 0.009 Easting 0.017 Northing

Table 8: Model Summary of revised SAAR regression

S (mm) R2 R2 (adjusted) PRESS R2 (predictive)

167.559 80.54% 78.00% 1115354 66.39%

Table 9: Coefficients of revised SAAR regression model

Term Coef SE Coef 95% CI T-Value P-Value VIF

Constant 12633 3758 (4859, 20407) 3.36 0.003

Elevation (m) 2.142 0.388 (1.339, 2.945) 5.52 0.000 1.94

Easting -0.00900 0.00487 (-0.01907, 0.00107) -1.85 0.077 1.52

Northing -0.01684 0.00549 (-0.02820, -0.00548) -3.07 0.005 1.88

Details of the model in Table 9 show that average annual rainfall within in the catchment increases

(positive coefficients) with higher progress towards higher elevations but decreases (negative

coefficients) with progress in northward and eastward directions. Prior understanding of the

predominant effect of orographic uplift (or cloud seeding) within the catchment and visual

inspection of catchment area maps assist detecting physical patterns. The catchment maps show

highland areas on the southern and eastern boundaries with lower lying areas towards the north.

Comparing the average rainfall map with DEM map (see Appendix), the low-lying northern reaches

of the catchment receive less rainfall. However, even the highlands in the eastern boundaries get

significantly less amounts of rainfall. This corresponds with the model predictions and can be

interpreted to mean a rain shadow effect caused by the highlands in the south-west shading rain

Figure 6: Residual plots of refined SAAR regression model

8

laden predominant south westerly winds (Pollock, et al., 2013). Relation of results with the given

2005 rainfall map which shows intense raining in the eastern highlands may be due to enhanced

cloud seeding during a convective storm.

Figure 6 shows residual plots which test the validity of the refined linear model. The residuals clearly

follow a normal distribution (Mean = -0.0000; AD = 0.283; p = 0.607) and as in the first model, the

residuals show a random pattern when plotted against record order. This random pattern of

residuals against order gives evidence that errors are not correlated with one another. This statistic

is also represented in Table 9 (all VIF values relatively close to 1).

However, the residuals in this revised model still show evidence of non-constant variance. This may

be due to missing variables in the model. From previous analysis, direction of slope (aspect)

combined with elevation may give better predictions of the SAAR responses in the catchment.

Thus applying this model to an ungauged site, its shortfalls must be taken into consideration as

predictions are accurate only if the model represents the true relationship. Given such a site, with

predictor variables: Elevation 400m, Easting 380000; Northing 500000, SAAR can be estimated

as follows:

Table 10: Model application results from ungauged site Estimated SAAR SE Fit 95% CI 95% PI

1648.3 mm 65.1 (1513.7, 1782.8) (1276.4, 2020.1)

From the above table, it is predicted that the SAAR is 1648.3mm (given a set of parameters) at 95%

confidence interval. This shows that there is a 95% chance that the true mean (expected value) of

SAAR lies between 1513.7mm and 1782.8mm. On the other hand, the prediction interval gives the

range of values that are likely to contain the particular estimated value 95% of times. This interval

has a wider range of values because it seeks to predict a particular value from a range rather than

the mean of a sample (a wider set) of values from the same range. Therefore, even if the model

rightly represents the expected value of responses given a set of variables, its representation of any

particular response given the same set of variables is at best a crude estimate.

9

Part Three Part three focusses on the

temporal relationship of

events with themselves and

one another. The temporal

focus aids understanding of

specific processes by

investigating behaviour

through time. This

understanding is crucial in

decision making, optimized

engineering design accurate

prediction because of the

dependence of future events

on past and present events.

Statistical tools assist the

detection of time-based patterns in data

and provide methods of analysis. One of

such analytical tools is autocorrelation,

which investigates the inherent memory

or influence of a process on itself

(Machiwel & Jha, 2012). Figure 7 shows a

correlogram of daily rainfall data from the

Eden catchment, lagged at 30 days, to

understand monthly variations. The

correlogram shows strong autocorrelation

which still have significant effects few days

after. This correlation is

observed physically as the

tendency for events to

persist in occurrence. Figure

8 highlights clear evidence

of this persistence in the red

ovals that highlight dry days

following dry days or wet

days following wet days.

Monthly rainfall also shows

strong autocorrelation with

the immediately succeeding

month. However, this effect

wanes significantly after one

month. This persistence or

influence by antecedent conditions is not evident in annual autocorrelation of rainfall at 10 year lags

(Figure 10). This is usually due to changes in the environment and the dissipation of physical inertia

over time. Rainfall is generally a phenomenon that responds quickly to alterations in atmospheric

18016214412610890725436181

35

30

25

20

15

10

5

0

Time Step Index (Days)

Rain

fall

Dep

th (

mm

)

Figure 8: Sample time series plot of Daily Rainfall Data showing persistence (autocorrelation) in rainfall data.

30282624222018161412108642

1.0

0.8

0.6

0.4

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

Lag

Au

toco

rrela

tio

n

Autocorrelation Function for Daily Rainfall(with 5% significance limits for the autocorrelations)

Figure 7: Correlogram of Daily Rainfall in the Eden Catchment (30-day lag)

121110987654321

1.0

0.8

0.6

0.4

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

Lag

Au

toco

rrela

tio

n

Autocorrelation Function for Monthly Rainfall(with 5% significance limits for the autocorrelations)

Figure 9: Autocorrelation (12 month lag) plot for Monthly Rainfall in the Eden Catchment

10

conditions. It is therefore,

more likely to exert

influence over subsequent

events in its time series only

for a relatively short period

as antecedent conditions

vary rapidly.

This dependence on

antecedent conditions is also

demonstrated in the

correlogram for stream flow

time series. High flows tend

to follow high flows and low

flows have a higher chance

of succeeding low flows.

Because time series are

usually a combination of

several complex and

intricately correlated

components, it is sometimes

possible for a certain

component to mask the

detection of another

component. This masking

prevents proper

understanding of the

masked component, which

may be crucial to overall

insight into the behaviour of

the time series. A clear

example of this masking effect is the effect of seasonality component on trend component.

Stream flows are known to

follow seasonal patterns of

high and low flows.

However, other factors such

as land use variations which

are not seasonal may affect

stream flow. It is therefore

necessary to strip the stream

flow series of its seasonality

component to determine the

significance of stream flow

variation caused by other

factors.

This process of stripping is

called deseasonalisation. To

10987654321

1.0

0.8

0.6

0.4

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

Lag

Au

toco

rrela

tio

n

Autocorrelation Function for Annual Rainfall(with 5% significance limits for the autocorrelations)

Figure 10: Autocorrelation (10 year lagged) of Annual rainfall in the Eden Catchment.

121110987654321

1.0

0.8

0.6

0.4

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

Lag Time (12 months)

Au

toco

rrela

tio

n

Autocorrelation Function for Monthly Flow(with 5% significance limits for the autocorrelations)

Figure 11: Autocorrelation (12 month lagged) for mean monthly flows at Eden Sheepmouth (1970 - 2000)

121110987654321

1.0

0.8

0.6

0.4

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

Lag

Au

toco

rrela

tio

n

Autocorrelation Function for Deseasonalised Monthly Flows(with 5% significance limits for the autocorrelations)

Figure 12: Correlogram of deseasonalised monthly flow data (12 month lagged)

11

achieve this, the difference between the observed data is standardized using the standard deviation.

This ensures that monthly variations are significantly different from seasonal variations. The formula

used for deseasonalising the data is:

=( )

: = observed flow for month; = mean for calendar month

= standard deviation for calendar month; = calendar month in question

The resulting correlogram in Figure 12 shows persistence extended only to adjacent months.

When seasonality is understood and

addressed, it is then possible to view

trends. Trend analysis is immediately

central to forecasting and projections, and

ultimately quintessential to decision making

processes which may rely on forecasts and

projections. Because forecasting and

projection models are good only if they

represent the true behaviour of the system,

the (partial duration) time series used to

detect a general trend must be

representative of the entire system. This

property is called ergodicity. The difficulty

of obtaining a representative time series is primarily due to record length limits. All behaviour which

precede the first available records can only be crudely guessed while behaviour (trends) which

succeeds record length can be predicted within reasonable uncertainty limits.

Figure 13: Mean Annual Temperatures for Central England for 1659 - 2011

Figure 14: Mean Annual Temperature of Central England (1701 - 1800)

12

The importance of record lengths to

developing decision support systems is

illustrated clearly in the following graph

(Figure 13) of mean annual temperature in

Central England from 1659 2011. This

temperature series has been split into three

century long partial duration series (Figure

14, Figure 15 and Figure 16).

Each partial series exhibits a unique trend

applicable only within its record length and

does not conform to the overall trend of

the entire series. This highlights the danger

of extrapolating outside the range of

predictor values. It is therefore imperative

to understand the uncertainty of the data

record period available for use and calibrate

decision support models to reflect such

unknowns accordingly.

The general upward trend of mean annual

temperatures displays the non-

homogeneity of the mean. This must either

be due to changes in the method of data

collection and/or the environment

(Machiwel & Jha, 2012). Variations in the

environment due to climate change are possible causes for this non-homogeneity.

Table 11: Trend model details and uncertainty parameters

Record Length Trend Equation Mean Absolute Percentage Error

Mean Absolute Deviation

Mean Squared Deviation

1701 1800 Y(t) = 9.31 - 0.003t 4.9% 0.44 0.34

1801 1900 Y(t) = 9.10 + 0.00036t 5.5% 0.49 0.38

1901 - 2000 Y(t) = 9.16 + 0.007t 4.1% 0.39 0.24 Complete Series Y(t) = 8.76 + 0.003t 5.3% 0.48 0.37

It must be emphasized nonetheless that the errors shown in Table 11 give error margins only for the

record length supplied to Minitab. This may mean that making extrapolations from one time window

to the next, additional error terms must be included, thus increasing uncertainty.

References Machiwel, D. & Jha, M., 2012. Hydrologic Time Series Analysis. New Delhi: Capital Publishing

Company.

Pollock, M. et al., 2013. World Meteorological Organisation. [Online]

Available at: http://www.wmo.int/pages/prog/www/IMOP/publications/IOM-116_TECO-

2014/Session%203/O3_9_Pollock_Accurate_Rainfall_measurement.pdf

[Accessed 9 December 2014].



Statistics CourseWork

Documents

Transcript of Statistics CourseWork