DmExercise6 2014

2

Click here to load reader

Transcript of DmExercise6 2014

Page 1: DmExercise6 2014

Data mining: exercise 6 13.3.2014

1. Copy the file from link ’vertigo’ of the course home page

http://www.uta.fi/sis/tie/tl. The file consists of data from six otoneurological

diseases. It contains diseases (diagnosis number in the first column): vestibular

schwannoma VS, benign positional vertigo BPV, Meniere’s disease MD, sudden

deafness SD, traumatic vertigo TV and vestibular neuritis VN. These are, in the

respective order, encoded with 1 - 6. In addition to the diagnosis number, there are

the data of five most important variables. The variables are the columns 2-6 in the

order shown in the following table. All variables are categorical. The four first

variables are ordinal and the last is binary.

variable categories

1st duration of vertigo no symptoms, a few days, 1-4

weeks, 1-4 months, <1 year, 1-4

years, >4 years

2nd

: frequency of vertigo

attacks

no spells, only once, 1-2

annually, 3-12 annually, 1-4

monthly, 2-7 weekly, several

times a day, constant dizziness

3rd

: duration of vertigo

attacks

no attacks, 1-15 s, 15 s – 5 min,

5 min – 4 h, 4 –24 h, 1-5 days

4th

: duration of hearing loss no hearing loss, a few days, 1-4

weeks, 1-4 months, <1 year, 1-4

years, >4 years

5th

: occurrence of head

injury in relation to the onset

of symptoms

no, yes

Data values were encoded from 0 to the maximum of a variable (equal to the

number of categories subtracted by 1) in the order presented in the table. Prepare

histograms in Excel for all variables for disease VS to present the value

distribution of each variable. Repeat the same task for disease MD. Are the

distributions of these two diseases different? Which variables could be useful to

separate patient cases of these two diseases?

2. Compute modes for each variable of every disease of the vertigo data. If they are

different, it can be promising for the separation of the disease classes.

Page 2: DmExercise6 2014

3. Let us look at the file under the ‘population’ link of the course home page

http://www.uta.fi/sis/tie/tl. Take the values of variables (columns) ‘born’

(syntyneet), ’dead’ (kuolleet), ‘fertility index’ (kokonaishedelmällisyysluku),

‘infant death rate’ (imeväiskuolleisuusluku) and ’population’ (väkiluku). Compute

correlation coefficient between ‘born’ and ‘fertility index’. Compute the

corresponding quantity between ‘born’ and ‘dead’. (Excel functions can be used.)

What can you conclude on the basis of correlation values obtained?

4. Compute still correlation between ‘population’ and ‘infant death rate’ and,

further, between ‘infant death rate’ and ‘dead’. What can you say about the

relations of these time series?

5. Perform ”population predictions” with the whole data column of variable

’population’ (väkiluku) using the Excel command ‘forecast’, which computes a

linear trend. Variable x of forecast is equal to ‘year’ (vuosi). Predict also with

‘kokonaishedelmällisyysluku’ (fertility index). Compute such predictions for the

years 2010, 2020, …, 2060 and 2100. Are the results realistic? Compare

population results to those of link of Statistics Finland (Age groups total):

http://tilastokeskus.fi/til/vaenn/2009/vaenn_2009_2009-09-30_tau_001_en.html

6. Continue running predictions for the abovementioned years with ‘forecast’ by

giving them for ‘born’ and ‘dead’. What kind of light do they shed on the future?

7. Finally, predict the population with ’forecast’ for 2005-2009 by using the whole

population data column up to 2004 and secondly with 1951-2004. Compare the

results to those actual given in the present population data file.