Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf ·...
Transcript of Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf ·...
![Page 1: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/1.jpg)
Data Analysis 3Explorative Analysis
Jan PlatošSeptember 20, 2020
Department of Computer ScienceFaculty of Electrical Engineering and Computer ScienceVŠB - Technical University of Ostrava
![Page 2: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/2.jpg)
Table of contents
1. Explorative Analysis
Processing of the Data
Feature types
Relationship between features
2. Feature rescaling
3. Categorial Data Encoding
1
![Page 3: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/3.jpg)
Explorative Analysis
![Page 4: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/4.jpg)
Explorative Analysis
• What do you imagine?
• What is the shape of the data?
• What is the distribution of the data?
• What is the distribution amoung the features?
• Is there any relation between features?
2
![Page 5: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/5.jpg)
Explorative Analysis
• What do you imagine?
• What is the shape of the data?
• What is the distribution of the data?
• What is the distribution amoung the features?
• Is there any relation between features?
2
![Page 6: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/6.jpg)
Explorative Analysis
• What do you imagine?
• What is the shape of the data?
• What is the distribution of the data?
• What is the distribution amoung the features?
• Is there any relation between features?
2
![Page 7: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/7.jpg)
Explorative Analysis
• What do you imagine?
• What is the shape of the data?
• What is the distribution of the data?
• What is the distribution amoung the features?
• Is there any relation between features?
2
![Page 8: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/8.jpg)
Explorative Analysis
• What do you imagine?
• What is the shape of the data?
• What is the distribution of the data?
• What is the distribution amoung the features?
• Is there any relation between features?
2
![Page 9: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/9.jpg)
Explorative Analysis
Exploratory Data Analysis refers to the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to
test hypothesis and to check assumptions with the help of summary
statistics and graphical representations.1
1https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15
3
![Page 10: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/10.jpg)
Processing of the Data
• Dataset investigations
• How many features it has?
• How many records it has?
• What is the content of the features?
• What type of the features it has - Categorial, Numeric, Text?
4
![Page 11: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/11.jpg)
Feature types - Categorial features
• How many distinct features it has?
• Is it a class label or regular feature?
• What is the distribution of the values?
5
![Page 12: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/12.jpg)
Feature types - Numeric features
• What is the minimum and maximum?
• What are the quartiles?
• What is the distribution/histogram of the data?
• What are the properties of the data distribution?
• Mean x =∑n
i=1 xin
• Median - a middle value of sorted values.
• Variance s2 =∑n
i=1(xi−x)2
n−1
• Std. deviation σ =√s2
• Inter-quartile range IQR = Q3 − Q1
6
![Page 13: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/13.jpg)
Feature types - Graphical analysis
• Histograms2
2https://en.wikipedia.org/wiki/Histogram
7
![Page 14: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/14.jpg)
Feature types - Graphical analysis
• Histograms2
2https://en.wikipedia.org/wiki/Histogram
7
![Page 15: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/15.jpg)
Feature types - Graphical analysis
• Histograms2
2https://en.wikipedia.org/wiki/Histogram
7
![Page 16: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/16.jpg)
Feature types - Graphical analysis
• Histograms2
2https://en.wikipedia.org/wiki/Histogram
7
![Page 17: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/17.jpg)
Feature types - Graphical analysis
• Histograms2
2https://en.wikipedia.org/wiki/Histogram
7
![Page 18: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/18.jpg)
Feature types - Graphical analysis
• Histograms2
2https://en.wikipedia.org/wiki/Histogram
7
![Page 19: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/19.jpg)
Feature types - Graphical analysis
• Histograms
• Ideal for categorial data
• Numeric values requires another parameter - Bin size
8
![Page 20: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/20.jpg)
Feature types - Graphical analysis
Box plot (Box and Whiskers plot)• Minimum : the lowest data pointexcluding any outliers.
• Maximum : the largest data pointexcluding any outliers.
• Median : the middle value of thedataset.
• First quartile : the median of the lowerhalf of the dataset.
• Third quartile : the median of the upperhalf of the dataset.
9
![Page 21: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/21.jpg)
Feature types - Graphical analysis
• Violin plot3
3https://towardsdatascience.com/violin-plots-explained-fb1d115e023d
10
![Page 22: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/22.jpg)
Relationship between features
• There are many ways how to identify the relationships between fetures.• Covariance - how much (and in what direction) should we expect one variable tochange when the other changes.
Cov(X, Y) =∑n
i−1(xi − x)(yi − y)n− 1
• Correlation - similarly to covariance, the power and directon of the relation.
Cor(X, Y) = Cov(X, Y)σxσy
11
![Page 23: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/23.jpg)
Relationship between features - Correlation Heatmap
12
![Page 24: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/24.jpg)
Relationship between features - Correlation Heatmap
13
![Page 25: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/25.jpg)
Relationship between features - Box plot
sepal_length sepal_width petal_length petal_width0
1
2
3
4
5
6
7
8
14
![Page 26: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/26.jpg)
Relationship between features - Violin plot
sepal_length sepal_width petal_length petal_width
0
2
4
6
8
15
![Page 27: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/27.jpg)
Relationship between features - Pair plot
5
6
7
8
sepa
l_len
gth
2.0
2.5
3.0
3.5
4.0
4.5
sepa
l_wid
th
1
2
3
4
5
6
7
peta
l_len
gth
4 6 8sepal_length
0.0
0.5
1.0
1.5
2.0
2.5
peta
l_wid
th
2 3 4 5sepal_width
2 4 6 8petal_length
0 1 2 3petal_width
speciessetosaversicolorvirginica
16
![Page 28: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/28.jpg)
Relationship between features - Strip plot
sepal_length sepal_width petal_length petal_widthmeasurement
0
1
2
3
4
5
6
7
8
valu
e
speciessetosaversicolorvirginica
17
![Page 29: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/29.jpg)
Relationship between features - Swarm plot
sepal_length sepal_width petal_length petal_widthmeasurement
0
1
2
3
4
5
6
7
8
valu
e
speciessetosaversicolorvirginica
18
![Page 30: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/30.jpg)
Feature rescaling
![Page 31: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/31.jpg)
Feature rescaling
• Feature scaling is a method used to normalize the range ofindependent variables or features of data.
• The process is known also as data normalization.
• The scaling is necessary to compare features between each other anduse a metric to compute non-biased distance.
• The scaling have to respect the nature of the data.
19
![Page 32: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/32.jpg)
Min-Max Scaling
• The most common scaling principle.
• Unifies the min and max among the features.
• Normalize into range [0, 1]
x′ = x−min(x)max(x)−min(x)
• Normalize into range [a,b]
x′ = a+ (x−min(x)) (b− a)max(x)−min(x)
• Maximum Absolute Scaling where we normalize to |max (x)|
20
![Page 33: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/33.jpg)
Mean Scaling
• Moves the mean of the data into 0.
• The final range is [−1,+1].
x′ = x− xmax(x)−min(x)
21
![Page 34: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/34.jpg)
Standardizing
• Normalize the values to zero mean and unit variance.
x′ = x− xσ
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0sepal_length
0
5
10
15
20
25
Coun
t
2 1 0 1 2sepal_length
0
5
10
15
20
25
30
Coun
t22
![Page 35: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/35.jpg)
Non-linear transform
• Remapping features distribution into form close to Normal distribution
• Box-Cox transform or Yeo-Johnson transform
x(λ) ={ xλi −1
λif λ ̸= 0
ln(xi) if λ = 0
23
![Page 36: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/36.jpg)
Categorial Data Encoding
![Page 37: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/37.jpg)
Categorial Data Encoding
• Ordinal encoding.
• One-hot encoding.
• Embedding (mainly for words)
• Word2Vec
• Glove
• …
24
![Page 38: Data Analysis 3 - Feature Selection, Representative-based ...pla06/files/mad3/mad3_02.pdf · DataAnalysis3 FeatureSelection,Representative-basedClustering,HierarchicalClustering JanPlatoš](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7050d0e8c3ea15a658d1dd/html5/thumbnails/38.jpg)
Questions?