Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION...

39
Data: EXPLORATION Data Mining Fabio Stella DATA EXPLORATION Fabio Stella Associate Professor c/o Department of Informatics, Systems and Communication University of Milano Bicocca

Transcript of Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION...

Page 1: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

DATA

EXPLORATION

Fabio Stella

Associate Professor

c/o Department of Informatics, Systems and Communication

University of Milano Bicocca

Page 2: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

Transcription and interpretation errors are responsibility of the lecturer.

Pang-Ning Tan, Michael Steinbach and Vipin Kumar

(2006). Introduction to Data Mining, Pearson

International.

Part of the material presented in this lecture is taken from the following book.

EXPLORATION

Page 3: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

The following concepts will be introduced:

✓ SUMMARY STATISTICS

• MEAN, MODE

• QUANTILE/PERCENTILE

• RANGE, VARIANCE, STANDARD DEVIATION

• AAD, MAD, IQR

✓ VISUALIZATION

• HISTOGRAM

• BOX-AND-WHISKERS

EXPLORATION

Page 4: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

You make the decision to first compute SUMMARY STATISTICS for all the ATTRIBUTES reported

in the churn.txt data file, when this is meaningful.

Your friend asks you to provide THE MOST BASIC SUMMARY OF THE DATA to gain a general

picture of HOW CHURNING IS PROGRESSING.

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 5: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

The CHURN ATTRIBUTE is QUALITATIVE and so you COMPUTE it’s MODE, i.e., the MOST FREQUENT

VALUE IN THE DATA SET:n y

9 1absolute frequency

9/10 1/10

10

0.9 0.1relative frequency 1

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 6: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

The DAY MINS ATTRIBUTE is QUANTITATIVE and so you consider a different summary statistic,

the QUANTILES of a set of values.

265.1 161.6 243.4 299.4 166.7 223.4 218.2 157.0 184.5157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 7: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

quantile of order 1/3

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

265.1 161.6 243.4 299.4 166.7 223.4 218.2 157.0 184.5157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 8: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

265.1 161.6 243.4 299.4 166.7 223.4 218.2 157.0 184.5157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

quantile of order 2/3

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 9: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

265.1 161.6 243.4 299.4 166.7 223.4 218.2 157.0 184.5157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

MEDIAN

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 10: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

213,269

1==

=

9

1iixmean

1x2x 9x... ... ... ... ...

EXPLORATION: SUMMARY STATISTICS

You also compute the MEAN OF DAY MINS.

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

265.1 161.6 243.4 299.4 166.7 223.4 218.2 157.0 184.58x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

Page 11: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

The MEAN is sensitive to anomalous records (OUTLIERS) while the MEDIAN is a MORE ROBUST

ESTIMATE OF THE MIDDLE of a set of values.

( )1x ( )2x ( )9x... ... ... ... ...

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

213,269

1==

=

9

1iixmean

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

Page 12: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

The TRIMMED MEAN is sometimes used.

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

213,269

1==

=

9

1iixmean

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

Page 13: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1EXPLORATION: SUMMARY STATISTICS

The TRIMMED MEAN is sometimes used.

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

213,269

1==

=

9

1iixmean

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

Page 14: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

208,98

EXPLORATION: SUMMARY STATISTICS

The TRIMMED MEAN is sometimes used.

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

213,269

1==

=

9

1iixmean

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

. .

Page 15: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

Another set of used summary statistics for quantitative attributes are those that measure the

dispersion (spread) of a set of values. RANGE

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 16: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1EXPLORATION: SUMMARY STATISTICS

Another set of used summary statistics for quantitative attributes are those that measure the

dispersion (spread) of a set of values. RANGE

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 17: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

MIN MAX

EXPLORATION: SUMMARY STATISTICS

Another set of used summary statistics for quantitative attributes are those that measure the

dispersion (spread) of a set of values. RANGE

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 18: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

142.4157.0299.4 =−=range

EXPLORATION: SUMMARY STATISTICS

Another set of used summary statistics for quantitative attributes are those that measure the

dispersion (spread) of a set of values. RANGE

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

MIN MAX

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 19: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

RANGE, can be misleading if most of the values are concentrated in a narrow band of values.

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

142.4157.0299.4 =−=range

MIN MAX

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 20: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

RANGE, can be misleading if most of the values are concentrated in a narrow band of values.

The VARIANCE is preferred.

( ) 2,496.5213.26x9

1ii = −=

=

2

8

1var

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 21: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

49.96== var

EXPLORATION: SUMMARY STATISTICS

RANGE, can be misleading if most of the values are concentrated in a narrow band of values.

The VARIANCE is preferred. STANDARD DEVIATION

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

( ) 2,496.5213.26x9

1ii = −=

=

2

8

1var

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 22: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

The variance depends on the mean and thus it is also sensitive to outliers. More robust

estimates of the spread of a set of values are ABSOLUTE AVERAGE DEVIATION

40.72213.26x9

1ii = −=

=9

1AAD

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 23: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

The variance depends on the mean and thus it is also sensitive to outliers. More robust

estimates of the spread of a set of values are MEDIAN ABSOLUTE DEVIATION

( ) ( )( ) 46.56213.26x213.26 91 =−−= ,...,xmedianMAD

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 24: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

The variance depends on the mean and thus it is also sensitive to outliers. More robust

estimates of the spread of a set of values are INTERQUARTILE RANGE (IQR)

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 25: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

25% QUANTILE 75% QUANTILE

EXPLORATION: SUMMARY STATISTICS

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

The variance depends on the mean and thus it is also sensitive to outliers. More robust

estimates of the spread of a set of values are INTERQUARTILE RANGE (IQR)

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 26: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

25% QUANTILE 75% QUANTILE

EXPLORATION: SUMMARY STATISTICS

157.0 161.6 166.7 184.5 218.2 223.4 243.4 265.1 299.4

( )1x ( )2x ( )9x... ... ... ... ... ( )8x

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

The variance depends on the mean and thus it is also sensitive to outliers. More robust

estimates of the spread of a set of values are INTERQUARTILE RANGE (IQR)

76.7166.7243.4 =−=IQR

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 27: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

When multiple Quantitative Attributes are available you usually compute the VARIANCE-

COVARIANCE MATRIX

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

( ) ( ) ( ) −−−

==

m

1iii yx yx

mY,Xcov

1

1

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 28: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

The VARIANCE-COVARIANCE MATRIX is square, symmetric and its “ij” element is the

covariance between the ith attribute and the jth attribute.

EXPLORATION: SUMMARY STATISTICS

When multiple Quantitative Attributes are available you usually compute the VARIANCE-

COVARIANCE MATRIX

( ) ( ) ( ) −−−

==

m

1iii yx yx

mY,Xcov

1

1

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 29: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1

Another measure of association between pairs of Quantitative Attributes which does not

depend on the variance of each attribute is the LINEAR CORRELATION COEFFICIENT

( ) ( )( ) ( )yvarxvar

Y,XcovY,Xcorr

=

EXPLORATION: SUMMARY STATISTICS

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 30: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

1EXPLORATION: SUMMARY STATISTICS

The LINEAR CORRELATION COEFFICIENT ranges in [-1,+1], the greater it is in absolute value

the stronger it is the linear relationship between the two attributes.

Another measure of association between pairs of Quantitative Attributes which does not

depend on the variance of each attribute is the LINEAR CORRELATION COEFFICIENT

After two days you receive an EMAIL MESSAGE FROM YOUR FRIEND with

attached a txt file named churn.

( ) ( )( ) ( )yvarxvar

Y,XcovY,Xcorr

=

You download and open the CHURN.TXT FILE AND INSPECT THE FIRST 10 LINES:

Area Code Day Mins Eve Mins Churn Int'l Plan VMail Plan Day Calls Night Calls Night Charge Intl Calls State Phone

415 265,1 197,4 n 0 1 110 91 11,01 3 ? 382-4657

415 161,6 195,5 n 0 1 123 103 11,45 3 OH 371-7191

? 243,4 121,2 n 0 0 114 104 7,32 5 NJ 358-1921

408 299,4 61,9 n ? 0 71 89 8,86 7 OH 375-9999

415 166,7 148,3 y 1 0 113 121 8,41 3 OK 330-6626

510 223,4 220,6 n 1 0 98 118 9,18 ? AL ?

510 218,2 348,5 n 0 1 88 118 9,57 7 MA 355-9993

415 157 103,1 n 1 0 ? 96 9,53 6 MO 329-9001

408 184,5 351,6 n 0 0 97 90 9,71 4 LA 335-4719

415 ? 222 n 1 1 84 97 14,69 5 WV 330-8173

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

Page 31: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

2

You also make the decision to exploit DATA VISUALIZATION, i.e., to display information in a

graphic or tabular format.

EXPLORATION: VISUALIZATION

HISTOGRAM, a plot that DISPLAYS THE DISTRIBUTION OF VALUES FOR ATTRIBUTES by DIVIDING

THE POSSIBLE VALUES INTO BINS and SHOWING THE NUMBER OF RECORDS THAT FALL INTO EACH

BIN.

HISTOGRAM, for QUANTITATIVE ATTRIBUTE each BIN IS AN INTERVAL OF VALUES, bins can have

the same width or not.

Page 32: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

3EXPLORATION: VISUALIZATION

HISTOGRAM, for QUALITATIVE ATTRIBUTE each BIN IS ASSOCIATED WITH A VALUE, when values

are too much they are aggregated to possibly form meaningful bins.

You also make the decision to exploit DATA VISUALIZATION, i.e., to display information in a

graphic or tabular format.

Page 33: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

4EXPLORATION: VISUALIZATION

BOX AND WHISKERS (BOX PLOT), are applied to QUANTITATIVE ATTRIBUTES ONLY.

You also make the decision to exploit DATA VISUALIZATION, i.e., to display information in a

graphic or tabular format.

Page 34: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

4EXPLORATION: VISUALIZATION

BOX AND WHISKERS (BOX PLOT), are applied to QUANTITATIVE ATTRIBUTES ONLY.

You also make the decision to exploit DATA VISUALIZATION, i.e., to display information in a

graphic or tabular format.

median=q50

Page 35: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

4EXPLORATION: VISUALIZATION

BOX AND WHISKERS (BOX PLOT), are applied to QUANTITATIVE ATTRIBUTES ONLY.

You also make the decision to exploit DATA VISUALIZATION, i.e., to display information in a

graphic or tabular format.

qL=q25 qu=q75

median=q50

Page 36: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

4EXPLORATION: VISUALIZATION

BOX AND WHISKERS (BOX PLOT), are applied to QUANTITATIVE ATTRIBUTES ONLY.

You also make the decision to exploit DATA VISUALIZATION, i.e., to display information in a

graphic or tabular format.

qL=q25 qu=q75

median=q50

Dq=q75-q25

Page 37: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

4EXPLORATION: VISUALIZATION

BOX AND WHISKERS (BOX PLOT), are applied to QUANTITATIVE ATTRIBUTES ONLY.

You also make the decision to exploit DATA VISUALIZATION, i.e., to display information in a

graphic or tabular format.

qL=q25 qu=q75

median=q50

smallest value

that is not an outlier

it is greater than

qL-1.5Dq

Dq=q75-q25

greatest value

that is not an outlier

it is smaller than

qU+1.5Dq

Page 38: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

4EXPLORATION: VISUALIZATION

BOX AND WHISKERS (BOX PLOT), are applied to QUANTITATIVE ATTRIBUTES ONLY.

You also make the decision to exploit DATA VISUALIZATION, i.e., to display information in a

graphic or tabular format.

qL=q25 qu=q75

median=q50

minimum value

in the data

Dq=q75-q25

maximum value

in the data

Page 39: Data Mining - [1] Data - 03 - Preprocessing...Data Mining –Fabio Stella Data: EXPLORATION Transcription and interpretation errors are responsibility of the lecturer. Pang-Ning Tan,

Data: EXPLORATIONData Mining – Fabio Stella

4EXPLORATION: VISUALIZATION

BOX AND WHISKERS (BOX PLOT), are applied to QUANTITATIVE ATTRIBUTES ONLY.

You also make the decision to exploit DATA VISUALIZATION, i.e., to display information in a

graphic or tabular format.

qL=q25 qu=q75

median=q50

10° percentile

Dq=q75-q25

90° percentile