Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The...
Transcript of Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The...
Seminar in Electronic Government
University of Fribourg, Department of Informatics
Case Study
Analysis of matching voters’ and candidates’ preferences applying two VAA-matching algorithms: A case study based on Peruvian Presidential Elections 2011.
STUDENT NAMES: José A. Mancera, Philipp Bosshard STUDENT NUMBERS: 10-801-207, 06-200-844 COURSE NAME: Electronic Government
DEPARTMENT: Department of Informatics
SUPERVISOR: ASSISTANT: DATE OF SUBMISSION:
Prof. Dr. Andreas Meier Luis Terán 11-29-2015
II
Table of contents
List of Figures ............................................................................................................ IV
1. Introduction ............................................................................................................. 1
1.1 Problem statement ............................................................................................ 1
1.2 Research Objectives and Methodology ............................................................. 1
1.2.1 Research Questions ................................................................................... 1
1.2.2 Objectives and Output of the thesis ............................................................ 1
1.2.3 Research Methodology ............................................................................... 2
1.3 Timetable .......................................................................................................... 2
1.4 Addressees ....................................................................................................... 2
2. Voting Advice Applications (VAA) ........................................................................... 3
2.1 Basic Definition ................................................................................................. 3
2.1 High dimensional models .................................................................................. 3
2.2 Low dimensional models ................................................................................... 4
3. Voting Advice Application Algorithms ..................................................................... 4
3.1 Types of different Algorithms ............................................................................ 4
3.1.1 Euclidean distance ..................................................................................... 4
3.1.2 Fuzzy C-means Algorithm .......................................................................... 4
4. Evaluation of VAA Algorithms ................................................................................. 8
4.1 Datasets for the cases ...................................................................................... 8
4.1.1 Peru Presidential voters’ answers .............................................................. 8
4.1.2 Peru Presidential candidates’ answers ....................................................... 9
4.2 Datasets Pre-Processing ................................................................................ 10
4.2.1 Voter’s Vector ........................................................................................... 10
4.2.2 Candidates’ Vector ................................................................................... 12
4.2.3 Principal Component Analysis .................................................................. 13
4.2.4 Cleaning of the original Dataset ................................................................ 14
4.2.5 Vote Intention ........................................................................................... 15
III
4.3 Fuzzy C-means Algorithm ............................................................................... 16
4.3.1 FCM (Random Approach) ........................................................................ 16
4.3.2 FCM (Candidates Approach) .................................................................... 16
4.3.3 FCM (Mean Voter Approach) ................................................................... 17
4.3.4 The FCM Process ..................................................................................... 19
4.4 VAA Algorithms Results and Analysis ............................................................. 20
4.4.1 Euclidean Distance ................................................................................... 20
4.4.2 FCM (Random Approach) ........................................................................ 22
4.4.3 FCM (Candidates Approach) .................................................................... 24
4.4.4 FCM (Mean User Approach)..................................................................... 25
4.4.5 Final Candidates Distances Matrix ........................................................... 27
4.5 Degrees of Membership .................................................................................. 27
4.5.1 Candidates Approach ............................................................................... 27
4.5.2 Mean User Approach ................................................................................ 28
5. Recommendations ................................................................................................ 29
6. Conclusion ............................................................................................................ 30
7. Future Work .......................................................................................................... 31
8. References ........................................................................................................... 33
Appendix I: The 5 Candidates’ Vectors for the 30 Issue questions ........................... 34
Appendix II: Matlab Code for the Project .................................................................. 35
IV
List of Figures
Figure 1: Peru Presidential 2011, Issue Questions ..................................................... 9
Figure 2: Extracting 30 dimensions (questions) from the Raw Dataset .................... 10
Figure 3: Vote Intention based on Superior Question 5 ............................................ 11
Figure 4: Peru Presidential candidates and their respective Party groups ............... 12
Figure 5: Cleaning of the Dataset ............................................................................. 14
Figure 6: PCA Dimensionality Reduction in 2D ........................................................ 15
Figure 7: PCA Dimensionality Reduction in 3D ........................................................ 16
Figure 8: Process to compute the Final Center based on candidates ...................... 17
Figure 9: Process to compute the Final Centers of the candidates based on users . 18
Figure 10: The FCM Process .................................................................................... 19
Figure 11: The Process of the Euclidean Distance ................................................... 20
Figure 12: Euclidean Distance Results ..................................................................... 21
Figure 13: Bar Chart of the Vote Intention ................................................................ 21
Figure 14: Random Final Centers ............................................................................. 23
Figure 15: Voters classified by Random Final Centers ............................................. 23
Figure 16: Final Candidates’ Centers ....................................................................... 24
Figure 17: Voters classified by Candidates' Final Centers ........................................ 24
Figure 18: Final Users' Centers ................................................................................ 25
Figure 19: Voters classified by Mean User Final Centers ......................................... 26
Figure 20: Final Candidates Distances Matrix .......................................................... 27
Figure 21: Degrees of Membership by Candidates (User 3)..................................... 28
Figure 22: Degrees of Membership by Mean Users (User 3) ................................... 28
Figure 23: Proposed Fuzzy Recommender Algorithm for VAAs ............................... 31
1
1. Introduction
1.1 Problem statement
Overview
The relationship between citizens and voters has always been in constant interaction
since the creation of governments. On the one hand, the identification with a certain
political party or candidate is difficult for the user. On the other hand, the political parties
can easily lose the visibility of the voters’ preferences, so that their strategies do not
reach to satisfy the needs and wants of the society.
The goal of this seminar thesis is to compare two VAA Algorithms and to evaluate their
accuracy. The aim of this comparison is to conduct an analysis in order to find potential
correlations between voter’s preferences and candidates from political parties.
1.2 Research Objectives and Methodology
1.2.1 Research Questions
The next group of questions is the guideline of our study, each of them is answered in
sequence during the evolution of the document.
1. Which kind of VAA matching algorithms exist?
2. Which algorithms fit most to correlate the voter’s preferences with the
candidates’ proposals?
3. What are the main differences in the results of the algorithms in terms of
(prediction) accuracy?
4. What are potential improvements for VAA’s and recommendations in order to
get a more complete analysis?
1.2.2 Objectives and Output of the thesis
The aim of this seminar thesis is to compare two VAA Algorithms by applying them to
a specific dataset and evaluate their accuracy. The test of these algorithms will help to
analyze correlations between voters’ preferences and political parties/candidates’
proposals. The results of the correlation analysis may imply differences among the
particular VAA algorithms.
2
1.2.3 Research Methodology
In a first step, selected textbooks, previous research papers and similar cases will be
taken into account in order to get an overview of the theoretical framework. The second
step of this thesis will consider the application, analysis and comparison of two
matching algorithms, based on the dataset of Peruvian Presidential Elections 2011.
The datasets are provided by preferencematcher.org.
1.3 Timetable
10-07-2015 Acceptance of working title
10-11-2015 Submission of the proposal
October 2015 Continue literature Research and
reading.
Writing Chapter 1, 2 and 3
November 2015 Applying the algorithms using the
datasets provided.
Writing Chapter 4, 5, 6 and 7
Draft of the paper
Finishing the report
Revision and Correction
11-02-2015 Midterm Appointment
11-29-2015 Submission of the thesis report
12-04-2015 Presentation of the thesis report
1.4 Addressees
The target audience of thesis is primarily students in the fields of computer science,
marketing, political sciences and professionals who are involved in the field of Voting
Advice Applications. The results of this seminar document should provide the parties
mentioned above not only valuable knowledge in order to better understand, analyze
and improve the quality of Voting Advice Applications, but also a better understanding
of the consequences of the algorithms on voters and parties.
3
2. Voting Advice Applications (VAA)
In order to get a better understanding and interpretation of the findings presented in
this seminar thesis, it is important to review some core concepts in the field of VAAs
before moving to the analysis and results. In the next two chapters, there is a briefly
overview of VAAs, taxonomy, characteristics and description of the algorithms
considered in the analysis.
2.1 Basic Definition
A Voting Advice Application can be defined as a System that provides the voter (user)
with information about a political Candidate or candidate. The aim is to find the
Candidate or candidate that is nearest to the voter’s political orientation. In order to
start the process of recommendation, the voter typically fills out a questionnaire with a
certain number of political issues. The results of this step is a created user profile. The
questionnaire itself has also been filled out by parties / candidates or if not the case,
the answers were provided by experts. In a second step, the VAA compares the
profiles generated for the user and the parties/candidate, tests their congruence and
serves the user with a ranking of those parties/candidates which are closest to his
political ideologies [1].
When it comes to the design of Voting Advice Applications, Mendez [2] distinguishes
between two main categories of preference matching techniques. The first category
includes High dimensional models whereas the second deals with Low dimensional
models.
2.1 High dimensional models
Many VAA’s are constructed out of a collection of issue policy statement. On average,
there is a number of 30 statements included in the VAA. In this case, the policy space
is high dimensional. For high dimensional matching, most VAA designers choose a
proximity model. The most commonly used metrics of the proximity model are
Euclidean Distance and the City Block metric. What matters most is the distance
between policy alternatives. In addition to the proximity model, a directional model can
be used for issue-voting. The aim of this model is to rather be on the “correct side” of
an argument [2]. The metric behind the directional model is mainly a Scalar Product
which first came up in 1989 by Rabinowitz and Macdonald [3].
4
2.2 Low dimensional models
In analogy to high dimensional matching, low dimensional models use the same logic
where the voter has a preference for his closest Candidate/candidates. It is crucial to
know, that the difference to high dimensional models is not related to the amount of
issue questions considered in the VAA. The difference lies in the dimensionality of the
political space. Typically, a solution space of 2 dimensions is used where the first
dimension may represent the political ideology (social liberalism vs. social
conservatism) and the second dimension stands for economic orientation (economic
left vs. economic right) Low dimensional models illustrate the political space of most
Western Democracies [2].
3. Voting Advice Application Algorithms
It is essential to mention that the number of VAAs that are available for voting
applications remain private and in most of similar research papers only present the
results rather than a mathematical description of the algorithms.
Nevertheless, there exist some algorithms that are the basis to build more complex
ones. In this section for the purposes of our research, we have selected some base
algorithms that would be applied to the data sets in order to get some interpretation of
the results.
3.1 Types of different Algorithms
3.1.1 Euclidean distance
The simplest approach to measure similarity is the Euclidean distance, where d(x,y) is
the degree of the distance:
Where n is the number of dimensions (attributes) and 𝑥𝑘 and 𝑦𝑘 are the kth attributes
(components) of data objects x and y, respectively [6].
3.1.2 Fuzzy C-means Algorithm
The Fuzzy C-means clustering algorithm is based on the minimization of an objective
function called C-means functional. It is defined by Dunn as:
(1.1)
5
Where:
is a vector of cluster prototypes (centers), which have to be determined, and
is a squared inner-product distance norm.
Statistically, (1.2) can be seen as a measure of the total variance of xk from vi. The
minimization of the c-means functional (1.2) represents a nonlinear optimization
problem that can be solved by using a variety of available methods, ranging from
grouped coordinate minimization, over simulated annealing to genetic algorithms. The
most popular method, however, is a simple Picard iteration through the first-order
conditions for stationary points of (1.2), known as the fuzzy c-means (FCM) algorithm.
The stationary points of the objective function (1.2) can be found by adjoining the
constraint (1.5) to J by means of Lagrange multipliers (1.6):
and by setting the gradients of (𝐽)̅ with respect to U, V and λ to zero. If D2ikA > 0, ∀i, k
and m > 1, then (U, V) ∈ Mfc × Rnxc may minimize (1.2) only if
And
(1.2)
(1.3)
(1.4)
(1.5)
(1.6)
(1.7)
(1.8)
6
Note that equation (1.8) gives vi as the weighted mean of the data items that belong to
a cluster, where the weights are the membership degrees. That is why the algorithm is
called "c-means". One can see that the FCM algorithm is a simple iteration through
(1.7) and (1.8).
The FCM algorithm computes with the standard Euclidean distance norm, which
induces hyper spherical clusters. Hence it can only detect clusters with the same shape
and orientation, because the common choice of norm inducing matrix is: A = I or it can
be chosen as an n x n diagonal matrix that accounts for different variances in the
directions in the directions of the coordinate axes of X:
or A can be defined as the inverse of the n x n covariance matrix: A = F-1, with
Here �̅� denotes the sample mean of the data. In this case, A induces the Mahalanobis
norm on Rn.
(1.9)
(1.10)
7
ALGORITHM STEPS:
Given the data set X, choose the number of clusters 1 < c < N, the weighting exponent
m > 1, the termination tolerance ε > 0 and the norm-inducing matrix A. Initialize the
partition matrix randomly, such that U(0) ϵ Mfc.
Repeat for l = 1,2,…
Step 1 Compute the cluster prototypes (means):
Step 2 Compute the Distances:
Step 3 Update the partition matrix:
(1.11)
(1.12)
(1.13)
8
4. Evaluation of VAA Algorithms
After a short overview of VAA basics in the previous sections, the focus of this chapter
lies in the implementation of two algorithms and the analysis of results of the used data
sets.
4.1 Datasets for the cases
The dataset of the voters’ answers to the 30 issue questions for Peruvian presidential
elections 2011 is provided by preferencematcher.org. The dataset for the candidates’
answers to the same issue questions was delivered by Dr. Fernando Mendez. The
Peruvian dataset for the candidates was filled in and scored by the candidates
themselves. In this subsection, we describe in details the content of the data sets.
4.1.1 Peru Presidential voters’ answers
The Peruvian Dataset had already been cleaned when delivered. The clean dataset
does not contain rogue data anymore. A rogue can be the case when a user answers
the issue statements in such a quick way (i.e. to test the application), that one must
assume, that he wasn’t reading them.
The clean dataset contains 40627 users who answered the following 30 issue
questions.
q1 The Peruvian state, rather than the public sector, should be the
owner of the most important businesses and industries of the
country.
q2 The market can resolve the problems in our society because it
distributes resources in a more efficient manner than the state.
q3 The government should limit, by law, interest rates charged by
banks.
q4 The government should control the prices of essential goods.
q5 It should be easier for companies to hire and fire employees.
q6 To keep unemployment rates low it would be acceptable to have a
higher rate of inflation.
q7 To balance the budget it is better to raise taxes than to cut
spending.
q8 The Peruvian government must honor the terms of the contracts on
which foreign companies have invested in Peru.
q9 It is more important to encourage economic growth than to protect
the environment
q10 It is better to finance road construction by private investment than
through taxes levied on all taxpayers
q11 Do you agree with a windfall tax on mining?
q12 After the reduction of IGV (general sales tax) from 19% to 18%, do
you think that IGV should be reduced still further?
q13 The government should spend more on public health services, even if
this may involve raising taxes.
q14 The government should spend more on public education, even if this
may involve raising taxes.
q15 Do you agree that teachers' salaries should be increased
unconditionally?
q16 Camisea gas should cover domestic consumption before being exported.
9
q17 The Free Trade Agreement with the United States should be
renegotiated.
q18 Peru should make more effort towards integration with neighboring
countries than in relations with the United States and Europe.
q19 Peru should introduce the death penalty for the rape of minors.
q20 The consumption of marijuana should be decriminalized in Peru.
q21 Homosexual couples should have the right to establish civil
partnerships.
q22 Abortion in the early months of pregnancy should be decriminalized.
q23 Should Compulsory Military Service be re-introduced?
q24 Do you agree that the budget for the defense sector should be
increased?
q25 A strict public security policy should be established, even if it
violates the human rights of offenders.
q26 The state child care program (Wawa Wasi) should be expanded.
q27 Do you agree that the salaries of senior public officials should be
increased?
q28 Should compulsory voting be maintained?
q29 Should the Congress once again have two chambers: Deputies and
senators?
q30 Parliamentary immunity should be abolished.
Figure 1: Peru Presidential 2011, Issue Questions
The questionnaire had the following answer categories:
-1 = “Completely Disagree”, -0.5 = “Disagree”, 2 = “Neither agree nor disagree”, 3 = “Agree”, 4 =
“Completely Disagree”, 99 = “No opinion”
4.1.2 Peru Presidential candidates’ answers
The dataset contains the answers of the top 5 candidates. The 30 issue questions are
the same as for the users.
The questionnaire had the following answer categories:
-1 = “Completely Disagree”, -0.5 = “Disagree”, 2 =” Neither agree nor disagree”, 3 = “Agree”, 4 =
“Completely Disagree”, 99 = “No opinion”
10
4.2 Datasets Pre-Processing
As we can observe in both datasets, the information per voter is very specific, vast and
the information can be represented as main or RAW vector. In addition, for the
purposes in our seminar study, we will define two types of vectors:
Voters’ vector: Vector that contains the most representative aspects of a voter.
Candidates’ vector: Vector that represents the characteristics of the political
candidate / party.
Both types of vectors can be illustrated in the following model [9]:
Where u (𝑖, 𝑘) and p (𝑗, 𝑘)(I,k) are the answers of the i-th user and j-th candidate
4.2.1 Voter’s Vector
The voter’s RAW vector in a VAA dataset normally contains a certain number of issue
questions (on average 30) and some additional questions (Superior Questions) which
contain demographic information, voting intention plus a self-placement for the voter’s
political orientation. The supplementary questions cannot be compared against a
candidate’s vector. Therefore, it is necessary to make a feature extraction in order to
create a voter’s vector that not only has less characteristics but also represents
properly the voter’s preferences (See Figure 2). The voter’s vector will be then the
vector of all covered issue questions.
RAW Vector from Dataset
Voter’s Vector
Size 30
Feature Extraction
Figure 2: Extracting 30 dimensions (questions) from the Raw Dataset
(1.14)
11
In addition to the 30 issue statements, we further consider Superior Question 5 of the
original dataset in our thesis. Superior Question 5 is the vote intention of the users for
the presidential candidates.
The coding for Superior Question 5 is given in the following:
1 = “Alejandro Toledo”, 2 = “Keiko Fujimori”, 3 = “Luis Castañeda Lossio”, 4 = “Pedro Pablo Kuczynski”,
5 = “Ollanta Humala”, 6 = “Other”, 7 = “None”, 98 = “Did not supply information”
Figure 3 shows the vote intention of the users in the original dataset in absolute values
and in percentages. The candidates will be introduced in section 4.2.2.
Vote intention
Frequency Percent Valid Percent
Cumulative
Percent
Valid Alejandro Toledo 5777 14,2 16,1 16,1
Keiko Fujimori 1535 3,8 4,3 20,4
Luis Castañeda Lossio 1884 4,6 5,3 25,7
Pedro Pablo Kuczynski 20397 50,2 57,0 82,7
Ollanta Humala 2450 6,0 6,8 89,5
Other 784 1,9 2,2 91,7
None 2963 7,3 8,3 100,0
Total 35790 88,1 100,0
Missing Did not supply information 4837 11,9
Total 40627 100,0
Figure 3: Vote Intention based on Superior Question 5
12
4.2.2 Candidates’ Vector
Based on the dataset structure analysis, this seminar thesis only contains the
candidates’ vectors for the top five candidates respectively the five strongest political
party alliances. Figure 4 gives an overview of the candidates and their political party
alliances. It is important to mention that party’s alliances make difficult to represent a
concrete ideology or political position, that is why we rather rely on a candidate’s
analysis that a party analysis. The 5 candidates’ vectors for the 30 issue questions can
be found in Appendix I.
Political Party or Group Presidential candidate
Alianza Gana Perú
Peruvian Nationalist Candidate (Partido Nacionalista Peruano)
Socialist Candidate (Partido Socialista)
Peruvian Communist Candidate (Partido Comunista Peruano)
Revolutionary Socialist Candidate (Partido Socialista Revolucionario)
Political Movement Socialist Voice (Movimiento Político Voz Socialista)
Ollanta Humala
Fuerza 2011
Force 2011 (Fuerza 2011)
National Renewal (Renovación Nacional)
Keiko Fujimori
Alianza Perú Posible
Possible Peru (Perú Posible)
We Are Peru (Somos Perú)
Popular Action (Acción Popular)
Alejandro Toledo
Alianza por el Gran Cambio
Alliance for Progress (Alianza para el Progreso)
Humanist Candidate (Partido Humanista)
Christian People's Candidate (Partido Popular Cristiano)
National Restoration (Restauración Nacional)
Pedro Pablo Kuczynski
Alianza Solidaridad Nacional
Change 90 (Cambio 90)
National Solidarity (Solidaridad Nacional)
Always Together (Siempre Unidos)
Union for Peru (Unión por el Perú)
Luís Castañeda Lossio
Figure 4: Peru Presidential candidates and their respective Party groups
13
4.2.3 Principal Component Analysis
One statistical procedure that helps us to perform later the algorithms evaluation is the
principal component analysis (PCA) involves a mathematical procedure that
transforms a number of (possibly) correlated variables into a (smaller) number of
uncorrelated variables called principal components. The principal component accounts
for as much of the variability in the data as possible, and each succeeding component
accounts for as much of the remaining variability as possible. The main objectives of
PCA are:
1. Identify new meaningful underlying variables.
2. Discover or to reduce the dimensionality of the data set.
The mathematical background lies in "Eigen analysis": The eigenvector associated
with the largest eigenvalue has the same direction as the first principal component.
The eigenvector associated with the second largest eigenvalue determines the
direction of the second principal component.
In this seminar paper, we used the second objective, in that case the covariance matrix
of the data set (also called the "data dispersion matrix") is defined as follows:
Where , the mean of the data (N equals the number of objects in the data set).
Principal Component Analysis (PCA) is based on the projection of correlated high-
dimensional data onto a hyperplane. This mapping uses only the first few q nonzero
eigenvalues and the corresponding eigenvectors of the ,covariance
matrix, decomposed to the matrix that includes the eigenvalues of in its
diagonal in decreasing order, and to the matrix that includes the eigenvectors
corresponding to the eigenvalues in its columns. The vector
is a q-dimensional reduced representation of the observed vector xk, where the Wi
weight matrix contains the q principal orthonormal axes in its column .
(1.15)
14
4.2.4 Cleaning of the original Dataset
In order to simplify our analysis and to illustrate the population of users, the original
dataset [30 issue-questions, 40627 users] had to be reduced. The reduction was done
in 4 steps. Figure 5 illustrates Steps 1 to Step 4 of the cleaning process.
RAW DATA MATRIX
Siz
e 3
0
40627
RAW DATA MATRIX
Siz
e 2
8
2500
RAW DATA MATRIX
Siz
e 3
0
32717
RAW DATA MATRIX
Siz
e 3
0
26149
1 2
RAW DATA MATRIX
Siz
e 3
0
2500
3
4
Figure 5: Cleaning of the Dataset
Step 1: It is important to mention that we only consider users which fully
answered all the 30 issue statements, i.e. any user that had at least one “99”
value was exclude from the analysis. The reason for that is that the Research
model in this seminar thesis is designed to only use fully answered
questionnaires. By cleaning the dataset for all “99” values, the dataset was
reduced from 40627 users (original size) to a size of 32727 users.
Step 2: The aim of this step was to downsize the amount of 32727 users to a
new amount of users that gave a clear statement about their vote intention.
Users which answered “Other”, “None” or “Did not supply information” were
excluded from the dataset. The cleaning done in of step 2 resulted in 26149
users.
Step 3: In this step, the new amount of 26149 users had to be downsized to a
smaller, reasonable quantity of users that represent the whole population. We
decided to reduce the dataset to 2500 users of which each of the five
presidential candidates has the same amount of 500 voters. We had to consider
an equal distribution of number of users per candidate as otherwise fairness is
15
not given and the later Fuzzy C-mean Algorithm would give preference to a
weight according to the amount of users per candidate.
Step 4: The last step was dealing with the reduction of the dimensionality of the
dataset. Originally, the candidates’ vector for the five presidential candidates
contained 30 issue questions. To get a clearer image of the political landscape,
only fully answered issue questions by the candidates can be taken into account
for later analysis. As 2 of the 30 issue statements (question 18 and questions
30) were not fully answered by the candidates, we had to reduce the candidates’
vector from 30 to 28 dimensions. This simple procedure of omitting 2 questions
is called “Feature Extraction” and is not to be confused with Principal
Component Analysis, where the reduced dimensions have a different value
range from the original vector.
The final RAW Data Matrix has the dimensions: [28 questions, 2500 users].
4.2.5 Vote Intention
This section shows the Vote Intention according to Superior Question 5 of the reduced
dataset. Figure 6 shows the Plot of the Vote Intention in 2 dimensions and Figure 7 in
the 3 dimensional space. Both times, the dimensions were reduced by Principal
Component Analysis to 2 respectively 3 dimensions.
Figure 6: PCA Dimensionality Reduction in 2D
16
Figure 7: PCA Dimensionality Reduction in 3D
As there is observed in Figures 6 and 7, the voter’s intention is mixed and it is not
clear to see in 2D or 3D a clear ideology or political position for voters. Fortunately,
during the evaluation, the VAA algorithms will allow to understand better the voter’s
answers and their relations.
4.3 Fuzzy C-means Algorithm
Our analysis contains three different FCM approaches. While the first version
(Standard FCM algorithm) does not allow to enter initial centers, versions 2 and 3 have
the ability to customize initial centers as part of the algorithm. They are modified
versions of the FCM algorithm. For reasons of simplicity, we use the abbreviation FCM
from this point on.
4.3.1 FCM (Random Approach)
As mentioned above, the first version of the FCM algorithm does not use initial centers.
The final centers of the candidates are then calculated randomly.
4.3.2 FCM (Candidates Approach)
The FCM Candidates Approach (see Figure 8) uses the candidates’ vector as an initial
value for cluster centers. The algorithm computes the final candidates’ centers based
17
on the input of initial candidates’ centers. The advantage of this approach is that the
candidates’ positions can be directly integrated into the algorithm. However, we believe
that the candidates’ vector is not a reliable measure to represent the position of the
party as there is the possibility that the candidate can manipulate the answers in his
favor.
Toledo’s Vector
Fujimori’s Vector
Castañeda’s Vector
Humala’s Vector
Size 28
Kuczynski’s Vector
Fuzzy C Means Alg.Initial C
an
did
ate Cen
ters
Inputs Final Centers
RAW DATA MATRIX
Siz
e 2
8
2500
Input
Figure 8: Process to compute the Final Center based on candidates
4.3.3 FCM (Mean Voter Approach)
The FCM Mean Voter Approach (see Figure 9) is only based on the voters’ dataset.
As this approach does not consider candidates’ vectors, one must create a vector that
represents the position of each candidate. This simple method calculates a Mean Voter
for each of the 5 candidates. The following definition shows the average voter of a
Candidate [9].
Where pj is the average voter of a political party or candidate, Nj are the total number
of voters of political party or candidate j, and u(i,k) the answers of the i-th user.
(1.16)
18
Once the 5 Mean Voters’ vectors are created, they can be integrated into the algorithm
as Initial Users’ Centers. The algorithm then runs and delivers the Final Users’ Centers.
The disadvantage of this method is, that the vectors of the users have to be constructed
first from the dataset before to run the algorithm. However, the method has the big
advantage that it takes the information from the whole population of users. As the user
is believed to take the survey seriously and being honest when filling in the
questionnaire, the result of Final Users’ Centers will show a more accurate image of
the candidates’ positions.
Toledo User’s Vector
Fujimori Users’s Vector
Castañeda Users’s Vector
Humala Users’s Vector
Size 28
Kuczynski Users’s Vector
Fuzzy C Means Alg.
Initial C
andid
ate Cente
rs Base
d o
n U
sers
Input Final Centers
RAW DATA MATRIX
Siz
e 2
8
2500
Input
Figure 9: Process to compute the Final Centers of the candidates based on users
19
4.3.4 The FCM Process
Figure 10 shows the Process for the two modified versions of the FCM algorithm from
the Raw Data Matrix until the Results (Final Fuzzy Cluster Centers) of the FCM
Algorithms.
Dimensionality
Reduction
RAW DATA MATRIX
Siz
e 2
8
2505
Fuzzy C Means Alg.PCA
RAW DATA MATRIX
Siz
e 2
2505
FINAL FUZZY
CLUSTER
CENTERS
Figure 10: The FCM Process
The base for the process is the Raw Data Matrix [28x2500] to which in the case of the
FCM Candidates Approach, the vectors of the 5 Candidates’ Initial Centers have been
added in the last five columns. In the case of the FCM Mean User Approach, the 5
Mean Users’ Initial Centers have been added at the same position of the Raw Data
Matrix for both approaches, the initial matrix now has the new dimensions [28x2505].
As a next step, the Principal Component Analysis over both matrices in order to be
able to visualize the users in a two-dimensional space. The new matrices now are
reduced from 28 to 2 dimensions. The next step run the FCM algorithm over the 2
matrices and supplied the Final Candidates’ Centers respectively the Final Users’
Centers. The results for the both Approaches will be shown and discussed in sections
4.4.3 and 4.4.4
20
4.4 VAA Algorithms Results and Analysis
In the next section, we present the results of the Euclidean Distance and the 3 versions
of the FCM algorithm with respect to their outcomes and analysis.
4.4.1 Euclidean Distance
The results of the Euclidean distance are based on the original dataset with 40627
users and their respective answers to the 30 issue questions. Also the original
candidates’ vectors were used for the computation. Figure 11 contains the process of
the Eulidean Distance.
Voter’s Vector
Voter’s Vector
Voter’s Vector
Voter’s Vector
Size 30
All
Vote
rs
Euclidean Alg.
Calculate the distance between
each voter’s vector and the
candidate’s vector.
To
led
o’s
Vecto
r
Fu
jimo
ri’s
Ve
cto
r
Ca
stañ
ed
a’s
Vecto
r
Hum
ala
’s V
ect
or
Ku
czynsk
i’s
Vecto
r
Siz
e 3
0
Choose Min
Distance and
Classify
Voters Voters
Voters Voters
Voters
Classify Voters
Figure 11: The Process of the Euclidean Distance
21
The results of the Euclidean Distance are shown in Figure 12.
Figure 12: Euclidean Distance Results
Figure 13: Bar Chart of the Vote Intention
22
A simple observation based on Figure 12 is that according to their way of thinking,
most users are similar to Humala, followed by Castañeda and Kuczynski. When we
look at the results of the Peruvian Presidential Elections in 2011, where Humala was
the winner, the result of the Euclidean Distance for Humala is consistent with his
electoral victory. At this point, an interesting idea is to compare the results of the
Euclidean Distance with the Vote Intention of the users (Superior Question 5). Figure
13 shows the bar chart of the Vote Intention. An eye-catching discovery is that
according to the Vote intention, most users favor Kuczynski, followed by Toledo and
Humala. The comparison of the two figures shows that the own perceptions of the
users vary greatly from their real way of thinking. These differences show that even
after filling out the questionnaire, the voters do not have found a good candidate to
vote for or they vote for a candidate’s popularity even if his statements do not match
with the voter’s statements. This fact leads to the conclusion that the results of the
Euclidean distance are not precise enough to provide the voter with a reliable
recommendation. For this reason, the next session (4.4.2) goes deeper into the
analysis of the Fuzzy C-means algorithm.
4.4.2 FCM (Random Approach)
Figure 14 and Figure 15 show the results of the Random FCM Clustering Algorithm
respectively the voters classified according to the Random Final Centers. The main
disadvantage of this method is that we do not have knowledge about the initial points.
This leads to the fact that every time the algorithm is running again, the Final Centers
will end up in a different position. As a consequence, the clustering of the users will
change as well. Therefore, the random approach is very low in terms of prediction
accuracy and should not be considered for the construction of a later Fuzzy
Recommender System.
23
Figure 14: Random Final Centers
Figure 15: Voters classified by Random Final Centers
24
4.4.3 FCM (Candidates Approach)
Figure 16 and Figure 17 shows the results of the Random FCM Clustering Algorithm
with the Initial Candidates’ Centers and Final Candidates Centers.
Figure 16: Final Candidates’ Centers
Figure 17: Voters classified by Candidates' Final Centers
25
In Figure 16, one can observe that the Initial Candidate’s Centers are located to the
very left of the graph and do not find themselves inside the scatterplot of the 2500
users. Furthermore, they are located far away from the Candidates Final centers
produced by the algorithm. This observation brings us back to section 4.3.2 to the
disadvantage of the Candidates’ Approach. The location of the Initial Centers shows
strong evidence for our apprehension that the candidate’s vector is not a reliable size
to represent the real position of the party as the candidate can manipulate the answers
in this favor for various reasons. Figure 17 contains the assignment of the users to
their particular candidate. The clustering performs very well as one can notice that
each Final Center lies pretty much in the center of the particular cluster of users.
4.4.4 FCM (Mean User Approach)
Figure 18 and Figure 19 show the results of the FCM Clustering Algorithm respectively
the voters classified by the Mean Users’ Final Centers.
Figure 18: Final Users' Centers
26
Figure 19: Voters classified by Mean User Final Centers
In Figure 18, it is clearly visible that the Initial Users’ Centers are all located close to
the Final Users’ Centers. The close proximity of them underlies our guess from section
4.3.3 about the high prediction accuracy of the FCM Mean User Approach. Therefore,
the User’s vector implies the best representation of a candidate and his particular party
groups. Similar to the case of the Candidates’ Approach, the clustering of the users in
Figure 19 performs well and the Final Users Centers are situated well in the respective
cluster.
As a summary, the Mean User Approach is the only one out of the 3 considered FCM
algorithms that has a high ability to prediction. To its performance comes the big
advantage that it does not require to have a candidate’s vector. It can be applied to
any dataset of users for any kind of election. However, in order to be able to perform
the algorithm, the user dataset must not only contain the answers to the issue
questions, but also a Superior Question which can be the vote intention, or the
preference for a candidate or a political party. Elsewise, the Mean User vector cannot
be computed.
27
4.4.5 Final Candidates Distances Matrix
After the application of the 3 versions of the FCM algorithm, this section presents the
Candidates Distances Matrix according to their Final Centers.
Figure 20: Final Candidates Distances Matrix
The red values in the Matrix indicate that two candidates are very close from each
other according to the results of the performed FCM algorithms. For instance, the
similarity score between Toledo computed by candidates and Castañeda computed by
users is 1.04E-0.4, located in the blue framed box in Figure 20. Looking at Toledo, we
discover that Toledo based on the answers of his questionnaire, is perceived as
Castañeda by the population. This fact shows that the perception of the people is that
Toledo behaves in a similar way similar as Castañeda in his way to answer. The same
kind of observations can be made for (Castañeda Candidates vs. Toledo Users) with
score 3,11E-0.5, (Fujimori Candidates vs. Kuczynski Users) with score 1.60E-0.5,
(Kuczynski Candidates vs. Humala Users) with score 8.93E-0.5 and (Humala
Candidates vs. and Fujimori Users with score 7.30E-0.5.
4.5 Degrees of Membership
4.5.1 Candidates Approach
After the FCM algorithm has been performed, one major goal of this seminar thesis is
to illustrate the degrees of membership to the five candidates for every user considered
in the dataset. The presentation of the degrees of membership will be the fundamental
result of any future Fuzzy based Recommender System for the user who wants to
know to which degree he belongs to the way of thinking of particular candidate /party.
For our seminar case, we decided to pick User 3. His degrees of Membership by
Candidates are shown below in Figure 21.
Toledo Fujimori Castañeda Kuczynski Humala Toledo Fujimori Castañeda Kuczynski Humala Toledo Fujimori Castañeda Kuczynski Humala
Toledo 0 0,0156 0,0412 0,0201 0,0281 0,0281 0,0156 0,0201 4,20E-08 0,0412 0,02 0,0413 0,0282 0,0156 8,93E-05
Fujimori 0,0156 0 0,0258 0,0356 0,0129 0,0129 3,72E-08 0,0356 0,0156 0,0258 0,0355 0,0258 0,0129 1,60E-05 0,0155
Castañeda 0,0412 0,0258 0 0,0613 0,0133 0,0133 0,0258 0,0613 0,0412 1,05E-08 0,0613 7,30E-05 0,0132 0,0257 0,0412
Kuczynski 0,0201 0,0356 0,0613 0 0,0482 0,0482 0,0356 1,89E-08 0,0201 0,0613 3,10E-05 0,0614 0,0482 0,0356 0,0201
Humala 0,0281 0,0129 0,0133 0,0482 0 2,33E-08 0,0129 0,0482 0,0281 0,0133 0,0482 0,0134 1,03E-04 0,0129 0,028
Toledo 0,0281 0,0129 0,0133 0,0482 2,33E-08 0 0,0129 0,0482 0,0281 0,0133 0,0482 0,0134 1,03E-04 0,0129 0,028
Fujimori 0,0156 3,72E-08 0,0258 0,0356 0,0129 0,0129 0 0,0356 0,0156 0,0258 0,0355 0,0258 0,0129 1,60E-05 0,0155
Castañeda 0,0201 0,0356 0,0613 1,89E-08 0,0482 0,0482 0,0356 0 0,0201 0,0613 3,11E-05 0,0614 0,0482 0,0356 0,0201
Kuczynski 4,20E-08 0,0156 0,0412 0,0201 0,0281 0,0281 0,0156 0,0201 0 0,0412 0,02 0,0413 0,0282 0,0156 8,93E-05
Humala 0,0412 0,0258 1,05E-08 0,0613 0,0133 0,0133 0,0258 0,0613 0,0412 0 0,0613 7,30E-05 0,0132 0,0257 0,0412
Toledo 0,02 0,0355 0,0613 3,10E-05 0,0482 0,0482 0,0355 3,11E-05 0,02 0,0613 0 0,0613 0,0482 0,0355 0,0201
Fujimori 0,0413 0,0258 7,30E-05 0,0614 0,0134 0,0134 0,0258 0,0614 0,0413 7,30E-05 0,0613 0 0,0133 0,0258 0,0413
Castañeda 0,0282 0,0129 0,0132 0,0482 1,03E-04 1,03E-04 0,0129 0,0482 0,0282 0,0132 0,0482 0,0133 0 0,0129 0,0281
Kuczynski 0,0156 1,60E-05 0,0257 0,0356 0,0129 0,0129 1,60E-05 0,0356 0,0156 0,0257 0,0355 0,0258 0,0129 0 0,0155
Humala 8,93E-05 0,0155 0,0412 0,0201 0,028 0,028 0,0155 0,0201 8,94E-05 0,0412 0,0201 0,0413 0,0281 0,0155 0
Candidates UsersFinal Centers
Ran
dom
Can
did
ate
sU
sers
Random
28
Figure 21: Degrees of Membership by Candidates (User 3)
4.5.2 Mean User Approach
Similar to the prior approach, Figure 22 displays the degrees of membership by Mean
Users again for User 3.
Figure 22: Degrees of Membership by Mean Users (User 3)
29
5. Recommendations
The goal of this section is to provide recommendations for VAA Designers who aim at
integrating a Fuzzy based logic in their applications. All recommendations are based
on observations which have been made during the work on this seminar thesis.
The key factor for a successful data analysis is a clean dataset which has been
swept for “rogue data”. Only then, one can conduct a serious analysis. If the
dataset has not yet been cleaned for rouges, consider a Weeding Out Rogues
Technique.
In our clean dataset, we encountered a huge amount of users that had a 99 =
”No opinion” for a particular answer to a question in their vector. We highly
recommend to rather eliminate the questions than keep them for the purpose of
not having a distorted image of the population.
To increase the accuracy of the Recommendation, voters and candidates
should be motivated to provide full answers to the issue questions.
Fairness (Equal distribution) of the users with their particular vote intention was
an important perquisite to run the Fuzzy C-means algorithm properly. Select the
same number of people. If users are taken in the same proportional amount per
candidate which leads to a preference weight by the algorithm), fairness is not
guaranteed anymore.
For the purpose of visualization of the data, consider to work in a low
dimensional space with two- or three-dimensions. The Principal Component
Analysis can be easily implemented and is a useful tool for dimensionality
reduction.
We do not recommend to use the FCM Random Approach because the Initial
Centers cannot be entered and the Final Centers will be in a different location,
every time the algorithm is performing again.
When choosing the Fuzzy C-means algorithm, the Mean User Approach is
highly recommended to be taken into consideration as it serves the most
accurate results. This approach shows the reality what voters want and need.
The Mean User Approach also has benefits regarding the candidates’
perspective as they can better target and localize the people which are similar
to themselves. If the Candidates Initial Centers are available, the Candidates
Approach can be considered for the purpose of comparison when creating the
30
clusters. For instance, there is the possibility that the both approaches can
deliver more similar results in another dataset.
Once, the different FCM algorithms have been run, researches can profit from
the Final Centers Distance Matrix in order to identify differences in the
perceptions of the candidate by the users and himself. Those differences were
not further examined but can be an interesting point to continue discussion.
6. Conclusion
The intensive work on the case study of Peruvian Presidential elections 2011 gave us
a comprehensive insight into the field of Voting Advice Applications. After having
collected the theoretical framework in prior case studies and literature, the pre-
processing of the original dataset turned out to be a first major challenge. As there is
no general procedure how to clean a dataset, our task was to try different methods and
to evaluate which way of reducing the dataset would be most appropriate. We finally
created a dataset which in our opinion was ideal to apply the different FCM algorithms.
In our practical part, we first applied the Euclidean Distance which is the one that was
used in most of prior research case studies. However, until now, only few authors have
ever applied a Fuzzy based Approach to Voting Advice Applications. This existing gap
was for us the ideal point to bind on and try something new – which leaded us to
introduce the Fuzzy C-Mean Algorithm and apply to the case of Peruvian Presidential
Elections.
The Introduction of the FCM algorithm was the base for the analysis and discussion of
the results of this seminar thesis. The results had a wide range starting from applying
the Euclidean Distance and comparing it with the Vote intention, over the comparison
and identification of differences in terms of prediction accuracy of the 3 tested FCM
algorithms, over the observations made in the Final Candidates Distance Matrix, until
the visualization of Degrees of Membership for a particular user.
We see our contribution in the application of the Fuzzy C-Means algorithm with its
corresponding visualization of the results and the proposition of a Fuzzy
Recommender System in Chapter 7. This seminar thesis provides an ideal base for
future research in terms of Fuzzy Recommender Systems and may motivate to conduct
further inquiries in the field of Voting Advice Applications.
31
7. Future Work
Based on previous results and recommendations, we consider that the algorithm has
shown good results and it has potential to be tested in another similar case, in the next
potential elections in Peru or other country, where either expert or candidates’ vectors
are available. Finally, we propose as a future work to go further and not only test what
we have done in this seminar but also the creation of a recommender system that
considers our Fuzzy C-Means Algorithm. Thereby in this section we mainly focus to
describe the proposed Recommender Architecture that can be applied in future
analysis and cases. The proposed architecture is shown in Figure 23.
QUESTIONS
Q1
Q2
Q3
User
Q4
Q28
Correct
Dimensions
Verification
Verify the number of
answers in order to
select dimensionality
and create a user
vector.
RAW DATA MATRIX
Rezi
se i
f
need
ed
2500
To
led
o’s
Vecto
r
Fu
jimo
ri’s
Ve
cto
r
Ca
stañ
ed
a’s
Vecto
r
Hum
ala
’s V
ect
or
Ku
czynsk
i’s
Vecto
r
Rezi
se i
f n
eed
ed
Merge Matrices
RAW DATA MATRIX
Fin
al Siz
e
2505
PCA
Inp
ut
Fuzzy C Means Alg.FINAL FUZZY
CLUSTER
CENTERS
RAW DATA MATRIX
Siz
e 2
2505
Use
r V
ecto
r
Fuzzy Rules
Final
Recommendation
(a)
(b)(c)
(d)
(e)(f)(g)
(h)
(i) (j)
Figure 23: Proposed Fuzzy Recommender Algorithm for VAAs
Based on Figure 23, we can observe different steps that are explained in the next
points:
32
a) The user who would like to receive a recommendation needs to answer a set of
28 questions or maybe more depending of the dataset obtained. Then
depending of the questions answered a verification process determines which
questions were answered and form a User’s Vector that will be added to the
general dataset.
b) Once the size of the User’s Vector is determined, the candidate and RAW Data
Matrix has to be in the same dimensions of the User’s Vector, therefore as we
mentioned before in the recommendations, the more than the user answers, the
better the recommendation provided.
c) Once the Raw Data and the candidate’s vectors were adapted to the User’s
Vector, both matrices are merged.
d) The new formed matrix is then processed in a dimensionality reduction to N
dimensions to 2 by the PCA algorithm.
e) The output of the PCA is in two dimensions and contains the candidate’s vectors
and the User’s Vector in the Raw Data Matrix.
f) The new 2 dimensional matrix is an input for the Fuzzy C-Means Algorithm.
g) The result is the centers of the candidates that the user has a choice to vote.
h) In this part we determine the level of membership assigned by the algorithm for
the user to each candidate or political party.
i) Once we know the degree of membership from the user, we proceed to apply
some fuzzy rules which can be customized by each context depending of the
country or additional preferences of the user like, if the user belongs 50% to two
candidates, then it will choose the most popular for instance. This fuzzy rules
part is optional but it can be helpful when the membership degree of users who
want a recommendation are very similar.
j) The last part is to provide a final recommendation and in case that the user has
not answered all the questions. The user then is invited to get a new
recommendation, based on more questions.
The key element of the success or high accuracy of this recommender is to find a good
dataset where all the users have answered all the answers and the amount of users
per candidate has the same amount, otherwise one candidate can influence the
distribution of the cluster centers. Finally, for the side of the user, as more information
the user provides, the better the recommendation will be provided.
33
8. References
[1] Teran, Luís. Smart Participation: A Fuzzy-Based Recommender System for Political
Community-Building. Ph.D. Thesis of the University of Fribourg, Department of
Informatics, 2014.
[2] Mendez, F. (2014) What’s behind a matching algorithm: A critical assessment of
how Voting Advice Applications produce voting recommendations, in: Marschall, S.,
Garzia, D. (Eds.), Matching Voters with Parties and Candidates (pp. 49-66).
Colchester: ECPR Press.
[3] Rabinowitz, G. and Macdonald, S. E. (1989). A directional theory of issue voting.
The American Political Science Review, 83(1):93{121.
[4] Garzia Diego; Marshall Stefan (eds), Matching voters with parties and candidates:
voting advice applications in a comparative perspective, Colchester: ECPR Press,
2014, pp. 105-114
[5] Garzia, Diego; Marschall Stefan (eds), Matching voters with parties and candidates:
voting advice applications in a comparative perspective, Colchester: ECPR Press,
2014, pp. 1-10
[6] Garzia, Diego; Marschall Stefan. Matching voters with parties and candidates:
Voting Advice Applications in a Comparative Perspective, Colchester; ECPR Press,
2014.
[7] Mendez, F., Gemenis, K., Djouvas, C. (2014) Methodological challenges in the
analysis of Voting Advice Application generated data. Proceedings of the 9th
International Workshop on Semantic and Social Media Adaptation and
Personalization, pp. 142-148
[8] Mendez, Fernando. Matching voters with political parties and candidates: an
empirical test of four algorithms. Int. J. of Electronic Governance, 2012 Vol.5, No.3/4,
pp.264 – 278.
[9] Katakis, I., N. Tsapatsoulis, V. Triga, C. Tziouvas, F. Mendez (2012). Clustering
Online Poll Data: Towards a Voting Assistance System. Proceedings of the 7th
International Workshop on Semantic and Social Media Adaptation and Personalization
(SMAP’12), Luxembourg
34
Appendix I: The 5 Candidates’ Vectors for the 30 Issue questions
-1 = “Completely Disagree”, -0.5 = “Disagree”, 2 =” Neither agree nor disagree”, 3 = “Agree”, 4 =
“Completely Disagree”, 99 = “No opinion”
Questions Toledo Fujimori Castañeda Kuczynski Humala
q1 1,0 1,0 1,0 1,0 -0,5
q2 0,5 -0,5 0,5 0,0 0,5
q3 0,5 0,5 0,5 1,0 -0,5
q4 0,5 0,5 0,5 1,0 -1
q5 0,5 -0,5 0,5 -1 0,5
q6 0,5 0,5 0,5 1,0 0,5
q7 0,5 0,5 0,0 1,0 -0,5
q8 -0,5 -0,5 -0,5 -1 0,5
q9 1,0 0,5 0,5 1,0 1,0
q10 -1 -0,5 0,0 -1 0,5
q11 -0,5 0,5 0,5 1,0 -1
q12 0,5 -1 0,5 -1 -0,5
q13 0,5 0,0 0,0 0,5 -0,5
q14 0,5 0,0 0,0 0,5 -0,5
q15 1,0 0,5 1,0 1,0 -0,5
q16 -1 -0,5 -1 -1 -1
q18 0 99,0 0.5 1 -0,5
q17 0,5 0,5 0,5 0,5 -1
q19 0,5 -1 0,5 1,0 1,0
q20 0,5 1,0 1,0 1,0 1,0
q21 -0,5 0,5 -0,5 -0,5 -0,5
q22 0,0 1,0 1,0 1,0 0,5
q23 0,5 -0,5 0,5 1,0 -0,5
q24 -0,5 -1 -0,5 0,0 -0,5
q25 0,5 0,0 0,5 1,0 0,5
q26 -1 -0,5 -1 -1 -0,5
q27 -0,5 -0,5 -1 -0,5 -0,5
q28 -0,5 0,5 0,0 1,0 -1
q29 -0,5 0,5 -0,5 1,0 -0,5
q30 0,0 99,0 0.5 -0,5 -0,5
35
Appendix II: Matlab Code for the Project
The next code is a new modified version for the Fuzzy Cluster Library in Matlab and
we created a new function based on the original one with some modifications and
suggestions of Dr. Luis Terán Tamayo who provided us a more complex modification
of this library. His support and help was very useful to understand the changes in the
code in order to create our version and allow the user to introduce the initial centers
in the function directly.
MATLAB CODE:
function result = CFCMclust(data,param,centers,mode)
%Authors: Jose A. Mancera, Philipp Bosshard
%University of Fribourg
%email: [email protected]
%email: [email protected]
path(path,'C:\Program Files\MATLAB\R2015b\toolbox\FUZZCLUST')
%% data normalization
data_bi_dim = data.X;
numb_centers=param.c;
%checking the parameters given
%default parameters
if exist('param.m')==1, m = param.m;else m = 2;end;
if exist('param.e')==1, e = param.m;else e = 1e-4;end;
[users,dimensions] = size(data_bi_dim);
unit_mat_temp = ones(users,1);
%% Initialize fuzzy partition matrix
if strcmp(mode, 'random')==1
mm = mean(data_bi_dim); %mean of the data
aa = max(abs(data_bi_dim - ones(users,1)*mm));
centers_clusters = 2*(ones(numb_centers,1)*aa).*(rand(numb_centers,dimensions)-0.5)
+ ones(numb_centers,1)*mm;
elseif strcmp(mode, 'candidate')==1
centers_clusters=centers;
elseif strcmp(mode, 'mean_voters')==1
centers_clusters=centers;
end
fprintf('The value of the initial centers given by the CFCM (x,y) are: \n')
centers_clusters
fprintf('The number of centers are : \n')
numb_centers
36
for j = 1 : numb_centers,
dist_cent_user = data_bi_dim - unit_mat_temp*centers_clusters(j,:);
d(:,j) = sum((dist_cent_user*eye(dimensions).*dist_cent_user),2);
end;
d = (d+1e-10).^(-1/(m-1));
matrix_centers = (d ./ (sum(d,2)*ones(1,numb_centers)));
f = zeros(users,numb_centers); % partition matrix
iter = 0; % iteration counter
%% Iterate
while max(max(matrix_centers-f)) > e
iter = iter + 1;
f = matrix_centers;
% Calculate centers
fm = f.^m;
sumf = sum(fm);
centers_clusters = (fm'*data_bi_dim)./(sumf'*ones(1,dimensions));
for j = 1 : numb_centers,
dist_cent_user = data_bi_dim - unit_mat_temp*centers_clusters(j,:);
d(:,j) = sum((dist_cent_user*eye(dimensions).*dist_cent_user),2);
end;
distout=sqrt(d);
J(iter) = sum(sum(matrix_centers.*d));
% Update matrix_centers
d = (d+1e-10).^(-1/(m-1));
matrix_centers = (d ./ (sum(d,2)*ones(1,numb_centers)));
end
fm = f.^m;
sumf = sum(fm);
%results
result.data.f = matrix_centers;
result.cluster.v = centers_clusters;
result.iter = iter;