DmExercise6 2014
Click here to load reader
-
Upload
techfolkcmr -
Category
Documents
-
view
212 -
download
0
Transcript of DmExercise6 2014
Data mining: exercise 6 13.3.2014
1. Copy the file from link ’vertigo’ of the course home page
http://www.uta.fi/sis/tie/tl. The file consists of data from six otoneurological
diseases. It contains diseases (diagnosis number in the first column): vestibular
schwannoma VS, benign positional vertigo BPV, Meniere’s disease MD, sudden
deafness SD, traumatic vertigo TV and vestibular neuritis VN. These are, in the
respective order, encoded with 1 - 6. In addition to the diagnosis number, there are
the data of five most important variables. The variables are the columns 2-6 in the
order shown in the following table. All variables are categorical. The four first
variables are ordinal and the last is binary.
variable categories
1st duration of vertigo no symptoms, a few days, 1-4
weeks, 1-4 months, <1 year, 1-4
years, >4 years
2nd
: frequency of vertigo
attacks
no spells, only once, 1-2
annually, 3-12 annually, 1-4
monthly, 2-7 weekly, several
times a day, constant dizziness
3rd
: duration of vertigo
attacks
no attacks, 1-15 s, 15 s – 5 min,
5 min – 4 h, 4 –24 h, 1-5 days
4th
: duration of hearing loss no hearing loss, a few days, 1-4
weeks, 1-4 months, <1 year, 1-4
years, >4 years
5th
: occurrence of head
injury in relation to the onset
of symptoms
no, yes
Data values were encoded from 0 to the maximum of a variable (equal to the
number of categories subtracted by 1) in the order presented in the table. Prepare
histograms in Excel for all variables for disease VS to present the value
distribution of each variable. Repeat the same task for disease MD. Are the
distributions of these two diseases different? Which variables could be useful to
separate patient cases of these two diseases?
2. Compute modes for each variable of every disease of the vertigo data. If they are
different, it can be promising for the separation of the disease classes.
3. Let us look at the file under the ‘population’ link of the course home page
http://www.uta.fi/sis/tie/tl. Take the values of variables (columns) ‘born’
(syntyneet), ’dead’ (kuolleet), ‘fertility index’ (kokonaishedelmällisyysluku),
‘infant death rate’ (imeväiskuolleisuusluku) and ’population’ (väkiluku). Compute
correlation coefficient between ‘born’ and ‘fertility index’. Compute the
corresponding quantity between ‘born’ and ‘dead’. (Excel functions can be used.)
What can you conclude on the basis of correlation values obtained?
4. Compute still correlation between ‘population’ and ‘infant death rate’ and,
further, between ‘infant death rate’ and ‘dead’. What can you say about the
relations of these time series?
5. Perform ”population predictions” with the whole data column of variable
’population’ (väkiluku) using the Excel command ‘forecast’, which computes a
linear trend. Variable x of forecast is equal to ‘year’ (vuosi). Predict also with
‘kokonaishedelmällisyysluku’ (fertility index). Compute such predictions for the
years 2010, 2020, …, 2060 and 2100. Are the results realistic? Compare
population results to those of link of Statistics Finland (Age groups total):
http://tilastokeskus.fi/til/vaenn/2009/vaenn_2009_2009-09-30_tau_001_en.html
6. Continue running predictions for the abovementioned years with ‘forecast’ by
giving them for ‘born’ and ‘dead’. What kind of light do they shed on the future?
7. Finally, predict the population with ’forecast’ for 2005-2009 by using the whole
population data column up to 2004 and secondly with 1951-2004. Compare the
results to those actual given in the present population data file.