Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The...

40
Seminar in Electronic Government University of Fribourg, Department of Informatics Case Study Analysis of matching voters’ and candidates’ preferences applying two VAA- matching algorithms: A case study based on Peruvian Presidential Elections 2011. STUDENT NAMES: José A. Mancera, Philipp Bosshard STUDENT NUMBERS: 10-801-207, 06-200-844 COURSE NAME: Electronic Government DEPARTMENT: Department of Informatics SUPERVISOR: ASSISTANT: DATE OF SUBMISSION: Prof. Dr. Andreas Meier Luis Terán 11-29-2015

Transcript of Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The...

Page 1: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

Seminar in Electronic Government

University of Fribourg, Department of Informatics

Case Study

Analysis of matching voters’ and candidates’ preferences applying two VAA-matching algorithms: A case study based on Peruvian Presidential Elections 2011.

STUDENT NAMES: José A. Mancera, Philipp Bosshard STUDENT NUMBERS: 10-801-207, 06-200-844 COURSE NAME: Electronic Government

DEPARTMENT: Department of Informatics

SUPERVISOR: ASSISTANT: DATE OF SUBMISSION:

Prof. Dr. Andreas Meier Luis Terán 11-29-2015

Page 2: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

II

Table of contents

List of Figures ............................................................................................................ IV

1. Introduction ............................................................................................................. 1

1.1 Problem statement ............................................................................................ 1

1.2 Research Objectives and Methodology ............................................................. 1

1.2.1 Research Questions ................................................................................... 1

1.2.2 Objectives and Output of the thesis ............................................................ 1

1.2.3 Research Methodology ............................................................................... 2

1.3 Timetable .......................................................................................................... 2

1.4 Addressees ....................................................................................................... 2

2. Voting Advice Applications (VAA) ........................................................................... 3

2.1 Basic Definition ................................................................................................. 3

2.1 High dimensional models .................................................................................. 3

2.2 Low dimensional models ................................................................................... 4

3. Voting Advice Application Algorithms ..................................................................... 4

3.1 Types of different Algorithms ............................................................................ 4

3.1.1 Euclidean distance ..................................................................................... 4

3.1.2 Fuzzy C-means Algorithm .......................................................................... 4

4. Evaluation of VAA Algorithms ................................................................................. 8

4.1 Datasets for the cases ...................................................................................... 8

4.1.1 Peru Presidential voters’ answers .............................................................. 8

4.1.2 Peru Presidential candidates’ answers ....................................................... 9

4.2 Datasets Pre-Processing ................................................................................ 10

4.2.1 Voter’s Vector ........................................................................................... 10

4.2.2 Candidates’ Vector ................................................................................... 12

4.2.3 Principal Component Analysis .................................................................. 13

4.2.4 Cleaning of the original Dataset ................................................................ 14

4.2.5 Vote Intention ........................................................................................... 15

Page 3: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

III

4.3 Fuzzy C-means Algorithm ............................................................................... 16

4.3.1 FCM (Random Approach) ........................................................................ 16

4.3.2 FCM (Candidates Approach) .................................................................... 16

4.3.3 FCM (Mean Voter Approach) ................................................................... 17

4.3.4 The FCM Process ..................................................................................... 19

4.4 VAA Algorithms Results and Analysis ............................................................. 20

4.4.1 Euclidean Distance ................................................................................... 20

4.4.2 FCM (Random Approach) ........................................................................ 22

4.4.3 FCM (Candidates Approach) .................................................................... 24

4.4.4 FCM (Mean User Approach)..................................................................... 25

4.4.5 Final Candidates Distances Matrix ........................................................... 27

4.5 Degrees of Membership .................................................................................. 27

4.5.1 Candidates Approach ............................................................................... 27

4.5.2 Mean User Approach ................................................................................ 28

5. Recommendations ................................................................................................ 29

6. Conclusion ............................................................................................................ 30

7. Future Work .......................................................................................................... 31

8. References ........................................................................................................... 33

Appendix I: The 5 Candidates’ Vectors for the 30 Issue questions ........................... 34

Appendix II: Matlab Code for the Project .................................................................. 35

Page 4: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

IV

List of Figures

Figure 1: Peru Presidential 2011, Issue Questions ..................................................... 9

Figure 2: Extracting 30 dimensions (questions) from the Raw Dataset .................... 10

Figure 3: Vote Intention based on Superior Question 5 ............................................ 11

Figure 4: Peru Presidential candidates and their respective Party groups ............... 12

Figure 5: Cleaning of the Dataset ............................................................................. 14

Figure 6: PCA Dimensionality Reduction in 2D ........................................................ 15

Figure 7: PCA Dimensionality Reduction in 3D ........................................................ 16

Figure 8: Process to compute the Final Center based on candidates ...................... 17

Figure 9: Process to compute the Final Centers of the candidates based on users . 18

Figure 10: The FCM Process .................................................................................... 19

Figure 11: The Process of the Euclidean Distance ................................................... 20

Figure 12: Euclidean Distance Results ..................................................................... 21

Figure 13: Bar Chart of the Vote Intention ................................................................ 21

Figure 14: Random Final Centers ............................................................................. 23

Figure 15: Voters classified by Random Final Centers ............................................. 23

Figure 16: Final Candidates’ Centers ....................................................................... 24

Figure 17: Voters classified by Candidates' Final Centers ........................................ 24

Figure 18: Final Users' Centers ................................................................................ 25

Figure 19: Voters classified by Mean User Final Centers ......................................... 26

Figure 20: Final Candidates Distances Matrix .......................................................... 27

Figure 21: Degrees of Membership by Candidates (User 3)..................................... 28

Figure 22: Degrees of Membership by Mean Users (User 3) ................................... 28

Figure 23: Proposed Fuzzy Recommender Algorithm for VAAs ............................... 31

Page 5: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

1

1. Introduction

1.1 Problem statement

Overview

The relationship between citizens and voters has always been in constant interaction

since the creation of governments. On the one hand, the identification with a certain

political party or candidate is difficult for the user. On the other hand, the political parties

can easily lose the visibility of the voters’ preferences, so that their strategies do not

reach to satisfy the needs and wants of the society.

The goal of this seminar thesis is to compare two VAA Algorithms and to evaluate their

accuracy. The aim of this comparison is to conduct an analysis in order to find potential

correlations between voter’s preferences and candidates from political parties.

1.2 Research Objectives and Methodology

1.2.1 Research Questions

The next group of questions is the guideline of our study, each of them is answered in

sequence during the evolution of the document.

1. Which kind of VAA matching algorithms exist?

2. Which algorithms fit most to correlate the voter’s preferences with the

candidates’ proposals?

3. What are the main differences in the results of the algorithms in terms of

(prediction) accuracy?

4. What are potential improvements for VAA’s and recommendations in order to

get a more complete analysis?

1.2.2 Objectives and Output of the thesis

The aim of this seminar thesis is to compare two VAA Algorithms by applying them to

a specific dataset and evaluate their accuracy. The test of these algorithms will help to

analyze correlations between voters’ preferences and political parties/candidates’

proposals. The results of the correlation analysis may imply differences among the

particular VAA algorithms.

Page 6: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

2

1.2.3 Research Methodology

In a first step, selected textbooks, previous research papers and similar cases will be

taken into account in order to get an overview of the theoretical framework. The second

step of this thesis will consider the application, analysis and comparison of two

matching algorithms, based on the dataset of Peruvian Presidential Elections 2011.

The datasets are provided by preferencematcher.org.

1.3 Timetable

10-07-2015 Acceptance of working title

10-11-2015 Submission of the proposal

October 2015 Continue literature Research and

reading.

Writing Chapter 1, 2 and 3

November 2015 Applying the algorithms using the

datasets provided.

Writing Chapter 4, 5, 6 and 7

Draft of the paper

Finishing the report

Revision and Correction

11-02-2015 Midterm Appointment

11-29-2015 Submission of the thesis report

12-04-2015 Presentation of the thesis report

1.4 Addressees

The target audience of thesis is primarily students in the fields of computer science,

marketing, political sciences and professionals who are involved in the field of Voting

Advice Applications. The results of this seminar document should provide the parties

mentioned above not only valuable knowledge in order to better understand, analyze

and improve the quality of Voting Advice Applications, but also a better understanding

of the consequences of the algorithms on voters and parties.

Page 7: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

3

2. Voting Advice Applications (VAA)

In order to get a better understanding and interpretation of the findings presented in

this seminar thesis, it is important to review some core concepts in the field of VAAs

before moving to the analysis and results. In the next two chapters, there is a briefly

overview of VAAs, taxonomy, characteristics and description of the algorithms

considered in the analysis.

2.1 Basic Definition

A Voting Advice Application can be defined as a System that provides the voter (user)

with information about a political Candidate or candidate. The aim is to find the

Candidate or candidate that is nearest to the voter’s political orientation. In order to

start the process of recommendation, the voter typically fills out a questionnaire with a

certain number of political issues. The results of this step is a created user profile. The

questionnaire itself has also been filled out by parties / candidates or if not the case,

the answers were provided by experts. In a second step, the VAA compares the

profiles generated for the user and the parties/candidate, tests their congruence and

serves the user with a ranking of those parties/candidates which are closest to his

political ideologies [1].

When it comes to the design of Voting Advice Applications, Mendez [2] distinguishes

between two main categories of preference matching techniques. The first category

includes High dimensional models whereas the second deals with Low dimensional

models.

2.1 High dimensional models

Many VAA’s are constructed out of a collection of issue policy statement. On average,

there is a number of 30 statements included in the VAA. In this case, the policy space

is high dimensional. For high dimensional matching, most VAA designers choose a

proximity model. The most commonly used metrics of the proximity model are

Euclidean Distance and the City Block metric. What matters most is the distance

between policy alternatives. In addition to the proximity model, a directional model can

be used for issue-voting. The aim of this model is to rather be on the “correct side” of

an argument [2]. The metric behind the directional model is mainly a Scalar Product

which first came up in 1989 by Rabinowitz and Macdonald [3].

Page 8: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

4

2.2 Low dimensional models

In analogy to high dimensional matching, low dimensional models use the same logic

where the voter has a preference for his closest Candidate/candidates. It is crucial to

know, that the difference to high dimensional models is not related to the amount of

issue questions considered in the VAA. The difference lies in the dimensionality of the

political space. Typically, a solution space of 2 dimensions is used where the first

dimension may represent the political ideology (social liberalism vs. social

conservatism) and the second dimension stands for economic orientation (economic

left vs. economic right) Low dimensional models illustrate the political space of most

Western Democracies [2].

3. Voting Advice Application Algorithms

It is essential to mention that the number of VAAs that are available for voting

applications remain private and in most of similar research papers only present the

results rather than a mathematical description of the algorithms.

Nevertheless, there exist some algorithms that are the basis to build more complex

ones. In this section for the purposes of our research, we have selected some base

algorithms that would be applied to the data sets in order to get some interpretation of

the results.

3.1 Types of different Algorithms

3.1.1 Euclidean distance

The simplest approach to measure similarity is the Euclidean distance, where d(x,y) is

the degree of the distance:

Where n is the number of dimensions (attributes) and 𝑥𝑘 and 𝑦𝑘 are the kth attributes

(components) of data objects x and y, respectively [6].

3.1.2 Fuzzy C-means Algorithm

The Fuzzy C-means clustering algorithm is based on the minimization of an objective

function called C-means functional. It is defined by Dunn as:

(1.1)

Page 9: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

5

Where:

is a vector of cluster prototypes (centers), which have to be determined, and

is a squared inner-product distance norm.

Statistically, (1.2) can be seen as a measure of the total variance of xk from vi. The

minimization of the c-means functional (1.2) represents a nonlinear optimization

problem that can be solved by using a variety of available methods, ranging from

grouped coordinate minimization, over simulated annealing to genetic algorithms. The

most popular method, however, is a simple Picard iteration through the first-order

conditions for stationary points of (1.2), known as the fuzzy c-means (FCM) algorithm.

The stationary points of the objective function (1.2) can be found by adjoining the

constraint (1.5) to J by means of Lagrange multipliers (1.6):

and by setting the gradients of (𝐽)̅ with respect to U, V and λ to zero. If D2ikA > 0, ∀i, k

and m > 1, then (U, V) ∈ Mfc × Rnxc may minimize (1.2) only if

And

(1.2)

(1.3)

(1.4)

(1.5)

(1.6)

(1.7)

(1.8)

Page 10: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

6

Note that equation (1.8) gives vi as the weighted mean of the data items that belong to

a cluster, where the weights are the membership degrees. That is why the algorithm is

called "c-means". One can see that the FCM algorithm is a simple iteration through

(1.7) and (1.8).

The FCM algorithm computes with the standard Euclidean distance norm, which

induces hyper spherical clusters. Hence it can only detect clusters with the same shape

and orientation, because the common choice of norm inducing matrix is: A = I or it can

be chosen as an n x n diagonal matrix that accounts for different variances in the

directions in the directions of the coordinate axes of X:

or A can be defined as the inverse of the n x n covariance matrix: A = F-1, with

Here �̅� denotes the sample mean of the data. In this case, A induces the Mahalanobis

norm on Rn.

(1.9)

(1.10)

Page 11: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

7

ALGORITHM STEPS:

Given the data set X, choose the number of clusters 1 < c < N, the weighting exponent

m > 1, the termination tolerance ε > 0 and the norm-inducing matrix A. Initialize the

partition matrix randomly, such that U(0) ϵ Mfc.

Repeat for l = 1,2,…

Step 1 Compute the cluster prototypes (means):

Step 2 Compute the Distances:

Step 3 Update the partition matrix:

(1.11)

(1.12)

(1.13)

Page 12: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

8

4. Evaluation of VAA Algorithms

After a short overview of VAA basics in the previous sections, the focus of this chapter

lies in the implementation of two algorithms and the analysis of results of the used data

sets.

4.1 Datasets for the cases

The dataset of the voters’ answers to the 30 issue questions for Peruvian presidential

elections 2011 is provided by preferencematcher.org. The dataset for the candidates’

answers to the same issue questions was delivered by Dr. Fernando Mendez. The

Peruvian dataset for the candidates was filled in and scored by the candidates

themselves. In this subsection, we describe in details the content of the data sets.

4.1.1 Peru Presidential voters’ answers

The Peruvian Dataset had already been cleaned when delivered. The clean dataset

does not contain rogue data anymore. A rogue can be the case when a user answers

the issue statements in such a quick way (i.e. to test the application), that one must

assume, that he wasn’t reading them.

The clean dataset contains 40627 users who answered the following 30 issue

questions.

q1 The Peruvian state, rather than the public sector, should be the

owner of the most important businesses and industries of the

country.

q2 The market can resolve the problems in our society because it

distributes resources in a more efficient manner than the state.

q3 The government should limit, by law, interest rates charged by

banks.

q4 The government should control the prices of essential goods.

q5 It should be easier for companies to hire and fire employees.

q6 To keep unemployment rates low it would be acceptable to have a

higher rate of inflation.

q7 To balance the budget it is better to raise taxes than to cut

spending.

q8 The Peruvian government must honor the terms of the contracts on

which foreign companies have invested in Peru.

q9 It is more important to encourage economic growth than to protect

the environment

q10 It is better to finance road construction by private investment than

through taxes levied on all taxpayers

q11 Do you agree with a windfall tax on mining?

q12 After the reduction of IGV (general sales tax) from 19% to 18%, do

you think that IGV should be reduced still further?

q13 The government should spend more on public health services, even if

this may involve raising taxes.

q14 The government should spend more on public education, even if this

may involve raising taxes.

q15 Do you agree that teachers' salaries should be increased

unconditionally?

q16 Camisea gas should cover domestic consumption before being exported.

Page 13: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

9

q17 The Free Trade Agreement with the United States should be

renegotiated.

q18 Peru should make more effort towards integration with neighboring

countries than in relations with the United States and Europe.

q19 Peru should introduce the death penalty for the rape of minors.

q20 The consumption of marijuana should be decriminalized in Peru.

q21 Homosexual couples should have the right to establish civil

partnerships.

q22 Abortion in the early months of pregnancy should be decriminalized.

q23 Should Compulsory Military Service be re-introduced?

q24 Do you agree that the budget for the defense sector should be

increased?

q25 A strict public security policy should be established, even if it

violates the human rights of offenders.

q26 The state child care program (Wawa Wasi) should be expanded.

q27 Do you agree that the salaries of senior public officials should be

increased?

q28 Should compulsory voting be maintained?

q29 Should the Congress once again have two chambers: Deputies and

senators?

q30 Parliamentary immunity should be abolished.

Figure 1: Peru Presidential 2011, Issue Questions

The questionnaire had the following answer categories:

-1 = “Completely Disagree”, -0.5 = “Disagree”, 2 = “Neither agree nor disagree”, 3 = “Agree”, 4 =

“Completely Disagree”, 99 = “No opinion”

4.1.2 Peru Presidential candidates’ answers

The dataset contains the answers of the top 5 candidates. The 30 issue questions are

the same as for the users.

The questionnaire had the following answer categories:

-1 = “Completely Disagree”, -0.5 = “Disagree”, 2 =” Neither agree nor disagree”, 3 = “Agree”, 4 =

“Completely Disagree”, 99 = “No opinion”

Page 14: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

10

4.2 Datasets Pre-Processing

As we can observe in both datasets, the information per voter is very specific, vast and

the information can be represented as main or RAW vector. In addition, for the

purposes in our seminar study, we will define two types of vectors:

Voters’ vector: Vector that contains the most representative aspects of a voter.

Candidates’ vector: Vector that represents the characteristics of the political

candidate / party.

Both types of vectors can be illustrated in the following model [9]:

Where u (𝑖, 𝑘) and p (𝑗, 𝑘)(I,k) are the answers of the i-th user and j-th candidate

4.2.1 Voter’s Vector

The voter’s RAW vector in a VAA dataset normally contains a certain number of issue

questions (on average 30) and some additional questions (Superior Questions) which

contain demographic information, voting intention plus a self-placement for the voter’s

political orientation. The supplementary questions cannot be compared against a

candidate’s vector. Therefore, it is necessary to make a feature extraction in order to

create a voter’s vector that not only has less characteristics but also represents

properly the voter’s preferences (See Figure 2). The voter’s vector will be then the

vector of all covered issue questions.

RAW Vector from Dataset

Voter’s Vector

Size 30

Feature Extraction

Figure 2: Extracting 30 dimensions (questions) from the Raw Dataset

(1.14)

Page 15: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

11

In addition to the 30 issue statements, we further consider Superior Question 5 of the

original dataset in our thesis. Superior Question 5 is the vote intention of the users for

the presidential candidates.

The coding for Superior Question 5 is given in the following:

1 = “Alejandro Toledo”, 2 = “Keiko Fujimori”, 3 = “Luis Castañeda Lossio”, 4 = “Pedro Pablo Kuczynski”,

5 = “Ollanta Humala”, 6 = “Other”, 7 = “None”, 98 = “Did not supply information”

Figure 3 shows the vote intention of the users in the original dataset in absolute values

and in percentages. The candidates will be introduced in section 4.2.2.

Vote intention

Frequency Percent Valid Percent

Cumulative

Percent

Valid Alejandro Toledo 5777 14,2 16,1 16,1

Keiko Fujimori 1535 3,8 4,3 20,4

Luis Castañeda Lossio 1884 4,6 5,3 25,7

Pedro Pablo Kuczynski 20397 50,2 57,0 82,7

Ollanta Humala 2450 6,0 6,8 89,5

Other 784 1,9 2,2 91,7

None 2963 7,3 8,3 100,0

Total 35790 88,1 100,0

Missing Did not supply information 4837 11,9

Total 40627 100,0

Figure 3: Vote Intention based on Superior Question 5

Page 16: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

12

4.2.2 Candidates’ Vector

Based on the dataset structure analysis, this seminar thesis only contains the

candidates’ vectors for the top five candidates respectively the five strongest political

party alliances. Figure 4 gives an overview of the candidates and their political party

alliances. It is important to mention that party’s alliances make difficult to represent a

concrete ideology or political position, that is why we rather rely on a candidate’s

analysis that a party analysis. The 5 candidates’ vectors for the 30 issue questions can

be found in Appendix I.

Political Party or Group Presidential candidate

Alianza Gana Perú

Peruvian Nationalist Candidate (Partido Nacionalista Peruano)

Socialist Candidate (Partido Socialista)

Peruvian Communist Candidate (Partido Comunista Peruano)

Revolutionary Socialist Candidate (Partido Socialista Revolucionario)

Political Movement Socialist Voice (Movimiento Político Voz Socialista)

Ollanta Humala

Fuerza 2011

Force 2011 (Fuerza 2011)

National Renewal (Renovación Nacional)

Keiko Fujimori

Alianza Perú Posible

Possible Peru (Perú Posible)

We Are Peru (Somos Perú)

Popular Action (Acción Popular)

Alejandro Toledo

Alianza por el Gran Cambio

Alliance for Progress (Alianza para el Progreso)

Humanist Candidate (Partido Humanista)

Christian People's Candidate (Partido Popular Cristiano)

National Restoration (Restauración Nacional)

Pedro Pablo Kuczynski

Alianza Solidaridad Nacional

Change 90 (Cambio 90)

National Solidarity (Solidaridad Nacional)

Always Together (Siempre Unidos)

Union for Peru (Unión por el Perú)

Luís Castañeda Lossio

Figure 4: Peru Presidential candidates and their respective Party groups

Page 17: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

13

4.2.3 Principal Component Analysis

One statistical procedure that helps us to perform later the algorithms evaluation is the

principal component analysis (PCA) involves a mathematical procedure that

transforms a number of (possibly) correlated variables into a (smaller) number of

uncorrelated variables called principal components. The principal component accounts

for as much of the variability in the data as possible, and each succeeding component

accounts for as much of the remaining variability as possible. The main objectives of

PCA are:

1. Identify new meaningful underlying variables.

2. Discover or to reduce the dimensionality of the data set.

The mathematical background lies in "Eigen analysis": The eigenvector associated

with the largest eigenvalue has the same direction as the first principal component.

The eigenvector associated with the second largest eigenvalue determines the

direction of the second principal component.

In this seminar paper, we used the second objective, in that case the covariance matrix

of the data set (also called the "data dispersion matrix") is defined as follows:

Where , the mean of the data (N equals the number of objects in the data set).

Principal Component Analysis (PCA) is based on the projection of correlated high-

dimensional data onto a hyperplane. This mapping uses only the first few q nonzero

eigenvalues and the corresponding eigenvectors of the ,covariance

matrix, decomposed to the matrix that includes the eigenvalues of in its

diagonal in decreasing order, and to the matrix that includes the eigenvectors

corresponding to the eigenvalues in its columns. The vector

is a q-dimensional reduced representation of the observed vector xk, where the Wi

weight matrix contains the q principal orthonormal axes in its column .

(1.15)

Page 18: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

14

4.2.4 Cleaning of the original Dataset

In order to simplify our analysis and to illustrate the population of users, the original

dataset [30 issue-questions, 40627 users] had to be reduced. The reduction was done

in 4 steps. Figure 5 illustrates Steps 1 to Step 4 of the cleaning process.

RAW DATA MATRIX

Siz

e 3

0

40627

RAW DATA MATRIX

Siz

e 2

8

2500

RAW DATA MATRIX

Siz

e 3

0

32717

RAW DATA MATRIX

Siz

e 3

0

26149

1 2

RAW DATA MATRIX

Siz

e 3

0

2500

3

4

Figure 5: Cleaning of the Dataset

Step 1: It is important to mention that we only consider users which fully

answered all the 30 issue statements, i.e. any user that had at least one “99”

value was exclude from the analysis. The reason for that is that the Research

model in this seminar thesis is designed to only use fully answered

questionnaires. By cleaning the dataset for all “99” values, the dataset was

reduced from 40627 users (original size) to a size of 32727 users.

Step 2: The aim of this step was to downsize the amount of 32727 users to a

new amount of users that gave a clear statement about their vote intention.

Users which answered “Other”, “None” or “Did not supply information” were

excluded from the dataset. The cleaning done in of step 2 resulted in 26149

users.

Step 3: In this step, the new amount of 26149 users had to be downsized to a

smaller, reasonable quantity of users that represent the whole population. We

decided to reduce the dataset to 2500 users of which each of the five

presidential candidates has the same amount of 500 voters. We had to consider

an equal distribution of number of users per candidate as otherwise fairness is

Page 19: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

15

not given and the later Fuzzy C-mean Algorithm would give preference to a

weight according to the amount of users per candidate.

Step 4: The last step was dealing with the reduction of the dimensionality of the

dataset. Originally, the candidates’ vector for the five presidential candidates

contained 30 issue questions. To get a clearer image of the political landscape,

only fully answered issue questions by the candidates can be taken into account

for later analysis. As 2 of the 30 issue statements (question 18 and questions

30) were not fully answered by the candidates, we had to reduce the candidates’

vector from 30 to 28 dimensions. This simple procedure of omitting 2 questions

is called “Feature Extraction” and is not to be confused with Principal

Component Analysis, where the reduced dimensions have a different value

range from the original vector.

The final RAW Data Matrix has the dimensions: [28 questions, 2500 users].

4.2.5 Vote Intention

This section shows the Vote Intention according to Superior Question 5 of the reduced

dataset. Figure 6 shows the Plot of the Vote Intention in 2 dimensions and Figure 7 in

the 3 dimensional space. Both times, the dimensions were reduced by Principal

Component Analysis to 2 respectively 3 dimensions.

Figure 6: PCA Dimensionality Reduction in 2D

Page 20: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

16

Figure 7: PCA Dimensionality Reduction in 3D

As there is observed in Figures 6 and 7, the voter’s intention is mixed and it is not

clear to see in 2D or 3D a clear ideology or political position for voters. Fortunately,

during the evaluation, the VAA algorithms will allow to understand better the voter’s

answers and their relations.

4.3 Fuzzy C-means Algorithm

Our analysis contains three different FCM approaches. While the first version

(Standard FCM algorithm) does not allow to enter initial centers, versions 2 and 3 have

the ability to customize initial centers as part of the algorithm. They are modified

versions of the FCM algorithm. For reasons of simplicity, we use the abbreviation FCM

from this point on.

4.3.1 FCM (Random Approach)

As mentioned above, the first version of the FCM algorithm does not use initial centers.

The final centers of the candidates are then calculated randomly.

4.3.2 FCM (Candidates Approach)

The FCM Candidates Approach (see Figure 8) uses the candidates’ vector as an initial

value for cluster centers. The algorithm computes the final candidates’ centers based

Page 21: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

17

on the input of initial candidates’ centers. The advantage of this approach is that the

candidates’ positions can be directly integrated into the algorithm. However, we believe

that the candidates’ vector is not a reliable measure to represent the position of the

party as there is the possibility that the candidate can manipulate the answers in his

favor.

Toledo’s Vector

Fujimori’s Vector

Castañeda’s Vector

Humala’s Vector

Size 28

Kuczynski’s Vector

Fuzzy C Means Alg.Initial C

an

did

ate Cen

ters

Inputs Final Centers

RAW DATA MATRIX

Siz

e 2

8

2500

Input

Figure 8: Process to compute the Final Center based on candidates

4.3.3 FCM (Mean Voter Approach)

The FCM Mean Voter Approach (see Figure 9) is only based on the voters’ dataset.

As this approach does not consider candidates’ vectors, one must create a vector that

represents the position of each candidate. This simple method calculates a Mean Voter

for each of the 5 candidates. The following definition shows the average voter of a

Candidate [9].

Where pj is the average voter of a political party or candidate, Nj are the total number

of voters of political party or candidate j, and u(i,k) the answers of the i-th user.

(1.16)

Page 22: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

18

Once the 5 Mean Voters’ vectors are created, they can be integrated into the algorithm

as Initial Users’ Centers. The algorithm then runs and delivers the Final Users’ Centers.

The disadvantage of this method is, that the vectors of the users have to be constructed

first from the dataset before to run the algorithm. However, the method has the big

advantage that it takes the information from the whole population of users. As the user

is believed to take the survey seriously and being honest when filling in the

questionnaire, the result of Final Users’ Centers will show a more accurate image of

the candidates’ positions.

Toledo User’s Vector

Fujimori Users’s Vector

Castañeda Users’s Vector

Humala Users’s Vector

Size 28

Kuczynski Users’s Vector

Fuzzy C Means Alg.

Initial C

andid

ate Cente

rs Base

d o

n U

sers

Input Final Centers

RAW DATA MATRIX

Siz

e 2

8

2500

Input

Figure 9: Process to compute the Final Centers of the candidates based on users

Page 23: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

19

4.3.4 The FCM Process

Figure 10 shows the Process for the two modified versions of the FCM algorithm from

the Raw Data Matrix until the Results (Final Fuzzy Cluster Centers) of the FCM

Algorithms.

Dimensionality

Reduction

RAW DATA MATRIX

Siz

e 2

8

2505

Fuzzy C Means Alg.PCA

RAW DATA MATRIX

Siz

e 2

2505

FINAL FUZZY

CLUSTER

CENTERS

Figure 10: The FCM Process

The base for the process is the Raw Data Matrix [28x2500] to which in the case of the

FCM Candidates Approach, the vectors of the 5 Candidates’ Initial Centers have been

added in the last five columns. In the case of the FCM Mean User Approach, the 5

Mean Users’ Initial Centers have been added at the same position of the Raw Data

Matrix for both approaches, the initial matrix now has the new dimensions [28x2505].

As a next step, the Principal Component Analysis over both matrices in order to be

able to visualize the users in a two-dimensional space. The new matrices now are

reduced from 28 to 2 dimensions. The next step run the FCM algorithm over the 2

matrices and supplied the Final Candidates’ Centers respectively the Final Users’

Centers. The results for the both Approaches will be shown and discussed in sections

4.4.3 and 4.4.4

Page 24: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

20

4.4 VAA Algorithms Results and Analysis

In the next section, we present the results of the Euclidean Distance and the 3 versions

of the FCM algorithm with respect to their outcomes and analysis.

4.4.1 Euclidean Distance

The results of the Euclidean distance are based on the original dataset with 40627

users and their respective answers to the 30 issue questions. Also the original

candidates’ vectors were used for the computation. Figure 11 contains the process of

the Eulidean Distance.

Voter’s Vector

Voter’s Vector

Voter’s Vector

Voter’s Vector

Size 30

All

Vote

rs

Euclidean Alg.

Calculate the distance between

each voter’s vector and the

candidate’s vector.

To

led

o’s

Vecto

r

Fu

jimo

ri’s

Ve

cto

r

Ca

stañ

ed

a’s

Vecto

r

Hum

ala

’s V

ect

or

Ku

czynsk

i’s

Vecto

r

Siz

e 3

0

Choose Min

Distance and

Classify

Voters Voters

Voters Voters

Voters

Classify Voters

Figure 11: The Process of the Euclidean Distance

Page 25: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

21

The results of the Euclidean Distance are shown in Figure 12.

Figure 12: Euclidean Distance Results

Figure 13: Bar Chart of the Vote Intention

Page 26: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

22

A simple observation based on Figure 12 is that according to their way of thinking,

most users are similar to Humala, followed by Castañeda and Kuczynski. When we

look at the results of the Peruvian Presidential Elections in 2011, where Humala was

the winner, the result of the Euclidean Distance for Humala is consistent with his

electoral victory. At this point, an interesting idea is to compare the results of the

Euclidean Distance with the Vote Intention of the users (Superior Question 5). Figure

13 shows the bar chart of the Vote Intention. An eye-catching discovery is that

according to the Vote intention, most users favor Kuczynski, followed by Toledo and

Humala. The comparison of the two figures shows that the own perceptions of the

users vary greatly from their real way of thinking. These differences show that even

after filling out the questionnaire, the voters do not have found a good candidate to

vote for or they vote for a candidate’s popularity even if his statements do not match

with the voter’s statements. This fact leads to the conclusion that the results of the

Euclidean distance are not precise enough to provide the voter with a reliable

recommendation. For this reason, the next session (4.4.2) goes deeper into the

analysis of the Fuzzy C-means algorithm.

4.4.2 FCM (Random Approach)

Figure 14 and Figure 15 show the results of the Random FCM Clustering Algorithm

respectively the voters classified according to the Random Final Centers. The main

disadvantage of this method is that we do not have knowledge about the initial points.

This leads to the fact that every time the algorithm is running again, the Final Centers

will end up in a different position. As a consequence, the clustering of the users will

change as well. Therefore, the random approach is very low in terms of prediction

accuracy and should not be considered for the construction of a later Fuzzy

Recommender System.

Page 27: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

23

Figure 14: Random Final Centers

Figure 15: Voters classified by Random Final Centers

Page 28: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

24

4.4.3 FCM (Candidates Approach)

Figure 16 and Figure 17 shows the results of the Random FCM Clustering Algorithm

with the Initial Candidates’ Centers and Final Candidates Centers.

Figure 16: Final Candidates’ Centers

Figure 17: Voters classified by Candidates' Final Centers

Page 29: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

25

In Figure 16, one can observe that the Initial Candidate’s Centers are located to the

very left of the graph and do not find themselves inside the scatterplot of the 2500

users. Furthermore, they are located far away from the Candidates Final centers

produced by the algorithm. This observation brings us back to section 4.3.2 to the

disadvantage of the Candidates’ Approach. The location of the Initial Centers shows

strong evidence for our apprehension that the candidate’s vector is not a reliable size

to represent the real position of the party as the candidate can manipulate the answers

in this favor for various reasons. Figure 17 contains the assignment of the users to

their particular candidate. The clustering performs very well as one can notice that

each Final Center lies pretty much in the center of the particular cluster of users.

4.4.4 FCM (Mean User Approach)

Figure 18 and Figure 19 show the results of the FCM Clustering Algorithm respectively

the voters classified by the Mean Users’ Final Centers.

Figure 18: Final Users' Centers

Page 30: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

26

Figure 19: Voters classified by Mean User Final Centers

In Figure 18, it is clearly visible that the Initial Users’ Centers are all located close to

the Final Users’ Centers. The close proximity of them underlies our guess from section

4.3.3 about the high prediction accuracy of the FCM Mean User Approach. Therefore,

the User’s vector implies the best representation of a candidate and his particular party

groups. Similar to the case of the Candidates’ Approach, the clustering of the users in

Figure 19 performs well and the Final Users Centers are situated well in the respective

cluster.

As a summary, the Mean User Approach is the only one out of the 3 considered FCM

algorithms that has a high ability to prediction. To its performance comes the big

advantage that it does not require to have a candidate’s vector. It can be applied to

any dataset of users for any kind of election. However, in order to be able to perform

the algorithm, the user dataset must not only contain the answers to the issue

questions, but also a Superior Question which can be the vote intention, or the

preference for a candidate or a political party. Elsewise, the Mean User vector cannot

be computed.

Page 31: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

27

4.4.5 Final Candidates Distances Matrix

After the application of the 3 versions of the FCM algorithm, this section presents the

Candidates Distances Matrix according to their Final Centers.

Figure 20: Final Candidates Distances Matrix

The red values in the Matrix indicate that two candidates are very close from each

other according to the results of the performed FCM algorithms. For instance, the

similarity score between Toledo computed by candidates and Castañeda computed by

users is 1.04E-0.4, located in the blue framed box in Figure 20. Looking at Toledo, we

discover that Toledo based on the answers of his questionnaire, is perceived as

Castañeda by the population. This fact shows that the perception of the people is that

Toledo behaves in a similar way similar as Castañeda in his way to answer. The same

kind of observations can be made for (Castañeda Candidates vs. Toledo Users) with

score 3,11E-0.5, (Fujimori Candidates vs. Kuczynski Users) with score 1.60E-0.5,

(Kuczynski Candidates vs. Humala Users) with score 8.93E-0.5 and (Humala

Candidates vs. and Fujimori Users with score 7.30E-0.5.

4.5 Degrees of Membership

4.5.1 Candidates Approach

After the FCM algorithm has been performed, one major goal of this seminar thesis is

to illustrate the degrees of membership to the five candidates for every user considered

in the dataset. The presentation of the degrees of membership will be the fundamental

result of any future Fuzzy based Recommender System for the user who wants to

know to which degree he belongs to the way of thinking of particular candidate /party.

For our seminar case, we decided to pick User 3. His degrees of Membership by

Candidates are shown below in Figure 21.

Toledo Fujimori Castañeda Kuczynski Humala Toledo Fujimori Castañeda Kuczynski Humala Toledo Fujimori Castañeda Kuczynski Humala

Toledo 0 0,0156 0,0412 0,0201 0,0281 0,0281 0,0156 0,0201 4,20E-08 0,0412 0,02 0,0413 0,0282 0,0156 8,93E-05

Fujimori 0,0156 0 0,0258 0,0356 0,0129 0,0129 3,72E-08 0,0356 0,0156 0,0258 0,0355 0,0258 0,0129 1,60E-05 0,0155

Castañeda 0,0412 0,0258 0 0,0613 0,0133 0,0133 0,0258 0,0613 0,0412 1,05E-08 0,0613 7,30E-05 0,0132 0,0257 0,0412

Kuczynski 0,0201 0,0356 0,0613 0 0,0482 0,0482 0,0356 1,89E-08 0,0201 0,0613 3,10E-05 0,0614 0,0482 0,0356 0,0201

Humala 0,0281 0,0129 0,0133 0,0482 0 2,33E-08 0,0129 0,0482 0,0281 0,0133 0,0482 0,0134 1,03E-04 0,0129 0,028

Toledo 0,0281 0,0129 0,0133 0,0482 2,33E-08 0 0,0129 0,0482 0,0281 0,0133 0,0482 0,0134 1,03E-04 0,0129 0,028

Fujimori 0,0156 3,72E-08 0,0258 0,0356 0,0129 0,0129 0 0,0356 0,0156 0,0258 0,0355 0,0258 0,0129 1,60E-05 0,0155

Castañeda 0,0201 0,0356 0,0613 1,89E-08 0,0482 0,0482 0,0356 0 0,0201 0,0613 3,11E-05 0,0614 0,0482 0,0356 0,0201

Kuczynski 4,20E-08 0,0156 0,0412 0,0201 0,0281 0,0281 0,0156 0,0201 0 0,0412 0,02 0,0413 0,0282 0,0156 8,93E-05

Humala 0,0412 0,0258 1,05E-08 0,0613 0,0133 0,0133 0,0258 0,0613 0,0412 0 0,0613 7,30E-05 0,0132 0,0257 0,0412

Toledo 0,02 0,0355 0,0613 3,10E-05 0,0482 0,0482 0,0355 3,11E-05 0,02 0,0613 0 0,0613 0,0482 0,0355 0,0201

Fujimori 0,0413 0,0258 7,30E-05 0,0614 0,0134 0,0134 0,0258 0,0614 0,0413 7,30E-05 0,0613 0 0,0133 0,0258 0,0413

Castañeda 0,0282 0,0129 0,0132 0,0482 1,03E-04 1,03E-04 0,0129 0,0482 0,0282 0,0132 0,0482 0,0133 0 0,0129 0,0281

Kuczynski 0,0156 1,60E-05 0,0257 0,0356 0,0129 0,0129 1,60E-05 0,0356 0,0156 0,0257 0,0355 0,0258 0,0129 0 0,0155

Humala 8,93E-05 0,0155 0,0412 0,0201 0,028 0,028 0,0155 0,0201 8,94E-05 0,0412 0,0201 0,0413 0,0281 0,0155 0

Candidates UsersFinal Centers

Ran

dom

Can

did

ate

sU

sers

Random

Page 32: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

28

Figure 21: Degrees of Membership by Candidates (User 3)

4.5.2 Mean User Approach

Similar to the prior approach, Figure 22 displays the degrees of membership by Mean

Users again for User 3.

Figure 22: Degrees of Membership by Mean Users (User 3)

Page 33: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

29

5. Recommendations

The goal of this section is to provide recommendations for VAA Designers who aim at

integrating a Fuzzy based logic in their applications. All recommendations are based

on observations which have been made during the work on this seminar thesis.

The key factor for a successful data analysis is a clean dataset which has been

swept for “rogue data”. Only then, one can conduct a serious analysis. If the

dataset has not yet been cleaned for rouges, consider a Weeding Out Rogues

Technique.

In our clean dataset, we encountered a huge amount of users that had a 99 =

”No opinion” for a particular answer to a question in their vector. We highly

recommend to rather eliminate the questions than keep them for the purpose of

not having a distorted image of the population.

To increase the accuracy of the Recommendation, voters and candidates

should be motivated to provide full answers to the issue questions.

Fairness (Equal distribution) of the users with their particular vote intention was

an important perquisite to run the Fuzzy C-means algorithm properly. Select the

same number of people. If users are taken in the same proportional amount per

candidate which leads to a preference weight by the algorithm), fairness is not

guaranteed anymore.

For the purpose of visualization of the data, consider to work in a low

dimensional space with two- or three-dimensions. The Principal Component

Analysis can be easily implemented and is a useful tool for dimensionality

reduction.

We do not recommend to use the FCM Random Approach because the Initial

Centers cannot be entered and the Final Centers will be in a different location,

every time the algorithm is performing again.

When choosing the Fuzzy C-means algorithm, the Mean User Approach is

highly recommended to be taken into consideration as it serves the most

accurate results. This approach shows the reality what voters want and need.

The Mean User Approach also has benefits regarding the candidates’

perspective as they can better target and localize the people which are similar

to themselves. If the Candidates Initial Centers are available, the Candidates

Approach can be considered for the purpose of comparison when creating the

Page 34: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

30

clusters. For instance, there is the possibility that the both approaches can

deliver more similar results in another dataset.

Once, the different FCM algorithms have been run, researches can profit from

the Final Centers Distance Matrix in order to identify differences in the

perceptions of the candidate by the users and himself. Those differences were

not further examined but can be an interesting point to continue discussion.

6. Conclusion

The intensive work on the case study of Peruvian Presidential elections 2011 gave us

a comprehensive insight into the field of Voting Advice Applications. After having

collected the theoretical framework in prior case studies and literature, the pre-

processing of the original dataset turned out to be a first major challenge. As there is

no general procedure how to clean a dataset, our task was to try different methods and

to evaluate which way of reducing the dataset would be most appropriate. We finally

created a dataset which in our opinion was ideal to apply the different FCM algorithms.

In our practical part, we first applied the Euclidean Distance which is the one that was

used in most of prior research case studies. However, until now, only few authors have

ever applied a Fuzzy based Approach to Voting Advice Applications. This existing gap

was for us the ideal point to bind on and try something new – which leaded us to

introduce the Fuzzy C-Mean Algorithm and apply to the case of Peruvian Presidential

Elections.

The Introduction of the FCM algorithm was the base for the analysis and discussion of

the results of this seminar thesis. The results had a wide range starting from applying

the Euclidean Distance and comparing it with the Vote intention, over the comparison

and identification of differences in terms of prediction accuracy of the 3 tested FCM

algorithms, over the observations made in the Final Candidates Distance Matrix, until

the visualization of Degrees of Membership for a particular user.

We see our contribution in the application of the Fuzzy C-Means algorithm with its

corresponding visualization of the results and the proposition of a Fuzzy

Recommender System in Chapter 7. This seminar thesis provides an ideal base for

future research in terms of Fuzzy Recommender Systems and may motivate to conduct

further inquiries in the field of Voting Advice Applications.

Page 35: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

31

7. Future Work

Based on previous results and recommendations, we consider that the algorithm has

shown good results and it has potential to be tested in another similar case, in the next

potential elections in Peru or other country, where either expert or candidates’ vectors

are available. Finally, we propose as a future work to go further and not only test what

we have done in this seminar but also the creation of a recommender system that

considers our Fuzzy C-Means Algorithm. Thereby in this section we mainly focus to

describe the proposed Recommender Architecture that can be applied in future

analysis and cases. The proposed architecture is shown in Figure 23.

QUESTIONS

Q1

Q2

Q3

User

Q4

Q28

Correct

Dimensions

Verification

Verify the number of

answers in order to

select dimensionality

and create a user

vector.

RAW DATA MATRIX

Rezi

se i

f

need

ed

2500

To

led

o’s

Vecto

r

Fu

jimo

ri’s

Ve

cto

r

Ca

stañ

ed

a’s

Vecto

r

Hum

ala

’s V

ect

or

Ku

czynsk

i’s

Vecto

r

Rezi

se i

f n

eed

ed

Merge Matrices

RAW DATA MATRIX

Fin

al Siz

e

2505

PCA

Inp

ut

Fuzzy C Means Alg.FINAL FUZZY

CLUSTER

CENTERS

RAW DATA MATRIX

Siz

e 2

2505

Use

r V

ecto

r

Fuzzy Rules

Final

Recommendation

(a)

(b)(c)

(d)

(e)(f)(g)

(h)

(i) (j)

Figure 23: Proposed Fuzzy Recommender Algorithm for VAAs

Based on Figure 23, we can observe different steps that are explained in the next

points:

Page 36: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

32

a) The user who would like to receive a recommendation needs to answer a set of

28 questions or maybe more depending of the dataset obtained. Then

depending of the questions answered a verification process determines which

questions were answered and form a User’s Vector that will be added to the

general dataset.

b) Once the size of the User’s Vector is determined, the candidate and RAW Data

Matrix has to be in the same dimensions of the User’s Vector, therefore as we

mentioned before in the recommendations, the more than the user answers, the

better the recommendation provided.

c) Once the Raw Data and the candidate’s vectors were adapted to the User’s

Vector, both matrices are merged.

d) The new formed matrix is then processed in a dimensionality reduction to N

dimensions to 2 by the PCA algorithm.

e) The output of the PCA is in two dimensions and contains the candidate’s vectors

and the User’s Vector in the Raw Data Matrix.

f) The new 2 dimensional matrix is an input for the Fuzzy C-Means Algorithm.

g) The result is the centers of the candidates that the user has a choice to vote.

h) In this part we determine the level of membership assigned by the algorithm for

the user to each candidate or political party.

i) Once we know the degree of membership from the user, we proceed to apply

some fuzzy rules which can be customized by each context depending of the

country or additional preferences of the user like, if the user belongs 50% to two

candidates, then it will choose the most popular for instance. This fuzzy rules

part is optional but it can be helpful when the membership degree of users who

want a recommendation are very similar.

j) The last part is to provide a final recommendation and in case that the user has

not answered all the questions. The user then is invited to get a new

recommendation, based on more questions.

The key element of the success or high accuracy of this recommender is to find a good

dataset where all the users have answered all the answers and the amount of users

per candidate has the same amount, otherwise one candidate can influence the

distribution of the cluster centers. Finally, for the side of the user, as more information

the user provides, the better the recommendation will be provided.

Page 37: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

33

8. References

[1] Teran, Luís. Smart Participation: A Fuzzy-Based Recommender System for Political

Community-Building. Ph.D. Thesis of the University of Fribourg, Department of

Informatics, 2014.

[2] Mendez, F. (2014) What’s behind a matching algorithm: A critical assessment of

how Voting Advice Applications produce voting recommendations, in: Marschall, S.,

Garzia, D. (Eds.), Matching Voters with Parties and Candidates (pp. 49-66).

Colchester: ECPR Press.

[3] Rabinowitz, G. and Macdonald, S. E. (1989). A directional theory of issue voting.

The American Political Science Review, 83(1):93{121.

[4] Garzia Diego; Marshall Stefan (eds), Matching voters with parties and candidates:

voting advice applications in a comparative perspective, Colchester: ECPR Press,

2014, pp. 105-114

[5] Garzia, Diego; Marschall Stefan (eds), Matching voters with parties and candidates:

voting advice applications in a comparative perspective, Colchester: ECPR Press,

2014, pp. 1-10

[6] Garzia, Diego; Marschall Stefan. Matching voters with parties and candidates:

Voting Advice Applications in a Comparative Perspective, Colchester; ECPR Press,

2014.

[7] Mendez, F., Gemenis, K., Djouvas, C. (2014) Methodological challenges in the

analysis of Voting Advice Application generated data. Proceedings of the 9th

International Workshop on Semantic and Social Media Adaptation and

Personalization, pp. 142-148

[8] Mendez, Fernando. Matching voters with political parties and candidates: an

empirical test of four algorithms. Int. J. of Electronic Governance, 2012 Vol.5, No.3/4,

pp.264 – 278.

[9] Katakis, I., N. Tsapatsoulis, V. Triga, C. Tziouvas, F. Mendez (2012). Clustering

Online Poll Data: Towards a Voting Assistance System. Proceedings of the 7th

International Workshop on Semantic and Social Media Adaptation and Personalization

(SMAP’12), Luxembourg

Page 38: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

34

Appendix I: The 5 Candidates’ Vectors for the 30 Issue questions

-1 = “Completely Disagree”, -0.5 = “Disagree”, 2 =” Neither agree nor disagree”, 3 = “Agree”, 4 =

“Completely Disagree”, 99 = “No opinion”

Questions Toledo Fujimori Castañeda Kuczynski Humala

q1 1,0 1,0 1,0 1,0 -0,5

q2 0,5 -0,5 0,5 0,0 0,5

q3 0,5 0,5 0,5 1,0 -0,5

q4 0,5 0,5 0,5 1,0 -1

q5 0,5 -0,5 0,5 -1 0,5

q6 0,5 0,5 0,5 1,0 0,5

q7 0,5 0,5 0,0 1,0 -0,5

q8 -0,5 -0,5 -0,5 -1 0,5

q9 1,0 0,5 0,5 1,0 1,0

q10 -1 -0,5 0,0 -1 0,5

q11 -0,5 0,5 0,5 1,0 -1

q12 0,5 -1 0,5 -1 -0,5

q13 0,5 0,0 0,0 0,5 -0,5

q14 0,5 0,0 0,0 0,5 -0,5

q15 1,0 0,5 1,0 1,0 -0,5

q16 -1 -0,5 -1 -1 -1

q18 0 99,0 0.5 1 -0,5

q17 0,5 0,5 0,5 0,5 -1

q19 0,5 -1 0,5 1,0 1,0

q20 0,5 1,0 1,0 1,0 1,0

q21 -0,5 0,5 -0,5 -0,5 -0,5

q22 0,0 1,0 1,0 1,0 0,5

q23 0,5 -0,5 0,5 1,0 -0,5

q24 -0,5 -1 -0,5 0,0 -0,5

q25 0,5 0,0 0,5 1,0 0,5

q26 -1 -0,5 -1 -1 -0,5

q27 -0,5 -0,5 -1 -0,5 -0,5

q28 -0,5 0,5 0,0 1,0 -1

q29 -0,5 0,5 -0,5 1,0 -0,5

q30 0,0 99,0 0.5 -0,5 -0,5

Page 39: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

35

Appendix II: Matlab Code for the Project

The next code is a new modified version for the Fuzzy Cluster Library in Matlab and

we created a new function based on the original one with some modifications and

suggestions of Dr. Luis Terán Tamayo who provided us a more complex modification

of this library. His support and help was very useful to understand the changes in the

code in order to create our version and allow the user to introduce the initial centers

in the function directly.

MATLAB CODE:

function result = CFCMclust(data,param,centers,mode)

%Authors: Jose A. Mancera, Philipp Bosshard

%University of Fribourg

%email: [email protected]

%email: [email protected]

path(path,'C:\Program Files\MATLAB\R2015b\toolbox\FUZZCLUST')

%% data normalization

data_bi_dim = data.X;

numb_centers=param.c;

%checking the parameters given

%default parameters

if exist('param.m')==1, m = param.m;else m = 2;end;

if exist('param.e')==1, e = param.m;else e = 1e-4;end;

[users,dimensions] = size(data_bi_dim);

unit_mat_temp = ones(users,1);

%% Initialize fuzzy partition matrix

if strcmp(mode, 'random')==1

mm = mean(data_bi_dim); %mean of the data

aa = max(abs(data_bi_dim - ones(users,1)*mm));

centers_clusters = 2*(ones(numb_centers,1)*aa).*(rand(numb_centers,dimensions)-0.5)

+ ones(numb_centers,1)*mm;

elseif strcmp(mode, 'candidate')==1

centers_clusters=centers;

elseif strcmp(mode, 'mean_voters')==1

centers_clusters=centers;

end

fprintf('The value of the initial centers given by the CFCM (x,y) are: \n')

centers_clusters

fprintf('The number of centers are : \n')

numb_centers

Page 40: Seminar in Electronic Government - unifr.ch · PDF file3.1.2 Fuzzy C-means Algorithm ... The results of this seminar document should provide the parties ... over simulated annealing

36

for j = 1 : numb_centers,

dist_cent_user = data_bi_dim - unit_mat_temp*centers_clusters(j,:);

d(:,j) = sum((dist_cent_user*eye(dimensions).*dist_cent_user),2);

end;

d = (d+1e-10).^(-1/(m-1));

matrix_centers = (d ./ (sum(d,2)*ones(1,numb_centers)));

f = zeros(users,numb_centers); % partition matrix

iter = 0; % iteration counter

%% Iterate

while max(max(matrix_centers-f)) > e

iter = iter + 1;

f = matrix_centers;

% Calculate centers

fm = f.^m;

sumf = sum(fm);

centers_clusters = (fm'*data_bi_dim)./(sumf'*ones(1,dimensions));

for j = 1 : numb_centers,

dist_cent_user = data_bi_dim - unit_mat_temp*centers_clusters(j,:);

d(:,j) = sum((dist_cent_user*eye(dimensions).*dist_cent_user),2);

end;

distout=sqrt(d);

J(iter) = sum(sum(matrix_centers.*d));

% Update matrix_centers

d = (d+1e-10).^(-1/(m-1));

matrix_centers = (d ./ (sum(d,2)*ones(1,numb_centers)));

end

fm = f.^m;

sumf = sum(fm);

%results

result.data.f = matrix_centers;

result.cluster.v = centers_clusters;

result.iter = iter;