Sustainable Questions

25
Sustainable Questions 27 August 2012 Bart de Goede Universiteit van Amsterdam Determining the expiration date of answers Supervisors Maarten de Rijke, Anne Schuth

Transcript of Sustainable Questions

Page 1: Sustainable Questions

Sustainable Questions

27 August 2012Bart de Goede

Universiteit van Amsterdam

Determining the expiration date of answers

SupervisorsMaarten de Rijke, Anne Schuth

Page 2: Sustainable Questions

Outline

• Introduction to CQA

• Problem statement

• Approach

• Cluster similar questions

• Compare answers in clusters

• Classify sustainable clusters

• Discussion and conclusion

Page 3: Sustainable Questions

Community Question Answering

• Community of users asking and answering questions

• Natural language

• Formally, a service that involves:

1) A method for a person to present his/her information need in natural language,

2) a place where other people can respond to that information need and

3) a community built around such a service based on participation. (Shah et al., 2009)

Page 4: Sustainable Questions

Community Question Answering

Page 5: Sustainable Questions

• CQA-services have many answered questions

• CQA-retrieval aims to find answered questions similar to the question a user posts

• However, not all questions may be readily reused:

• Who designed the Eiffel Tower?Alexander Gustave Eiffel.

• Who is the prime minister of the UK?Now: David Cameron. Before: Gordon Brown.

Community Question Answering

Page 6: Sustainable Questions

• Some questions are sustainable and can readily be reused, others are not

• A question is sustainable if the answer to that question is independent of the point in time the question is asked

• So, if the answer to semantically similar questions over time does not change, the questions are considered sustainable

Problem statement

Page 7: Sustainable Questions

RQ1: What are the distinguishing properties of sustainable ques- tions?

RQ2: Can we measure these properties of sustainability?

RQ3: Can we tell sustainable and non-sustainable questions apart based on these properties?

Research questions

Page 8: Sustainable Questions

Approach:What makes a question sustainable?

1. Cluster semantically similar questions

2. Compare answers in each cluster

3. Classify clusters as sustainable

Time

Page 9: Sustainable Questions

Cluster semantically similar questions

• Questions are semantically similar if they would be satisfied by the same information when asked at the same time

• However, questions tend to be

• very short

• phrased in different ways

• noisy

• littered with function words

Page 10: Sustainable Questions

Cluster semantically similar questions

• Latent Semantic Analysis (LSA; Deerwester et al., 1990) or Latent Dirichlet Allocation (LDA; Blei et al., 2003)

• topic modeling techniques

• cosine distance between topic vectors

• Locality Sensitive Hashing (LSH; Charikar, 2002)

• Used for near-duplicate detection

• Intuition: near-duplicates are very likely to be similar

Page 11: Sustainable Questions

Cluster semantically similar questions

• Manually labeled set of 559 question pairs

• Calculate accuracy on samples of Yahoo! Answers Comprehensive Questions and Answers version 1.0

sample size

algorithm 10K 100K all

LDA 0.435 0.500 -

LSA 0.706 0.638 -

LSH16bits 0.472 0.484 0.500

LSH24bits 0.465 0.502 0.495

LSH32bits 0.512 0.514 0.509

LSH40bits 0.523 0.537 0.542

Table 2: Accuracy of several question clustering methods. Miss-ing values represent experiments that never terminated.

In order to overcome vocabulary mismatch (different words with

the same meaning are used in either the question or the answer),

different ways of spelling, and improve overall matching on a se-

mantic level, we use the semantic linking system of Meij et al. [24],

developed to determine concepts in tweets.

This system approaches a similar problem; finding out what short

pieces of text are about. In operationalisation, pages on Wikipedia

are considered as concepts. Subsequently, a model is trained to

estimate the probability that a concept c is the target of an hyperlink

(in Wikipedia) witn an anchor text containing an ngram q. Given

a question or an answer, we obtain the set of concepts that are

likely to be linked to by occurence of ngrams in that question or

answer in Wikipedia anchor texts, as well as the score (a sum of

the probabilities of all ngrams in the piece of text linking to that

specific concept).

Using these concepts as document vectors, and their scores as

the values in those vectors, rather than tf-idf on the bag-of-words

representation of questions and answers, we hope to obtain a less

noisy vector space (change between answers is based solely on

concepts present in the text), and diminish the influence of spelling

and vocabulary mismatch. We will refer to documents processed

this way as ‘semanticized’.

We view the clustering of questions as a preprocessing step and

therefore take it as part of the experimental setup. We explore three

approaches to finding similar questions: latent semantic analysis [8],

latent Dirichlet allocation [4] and locality senstive hashing [6].

From the output of each clustering method on the 10K dataset,

we sampled 559 pairs of questions and manually labeled 205 as cor-

rectly clustered together and 354 as wrongly clustered together. We

used the combined set of labels (randomly sampling 205 questions

from the wrongly-clustered set) to arrive at the results in Table 2;

for each labeled pair of questions we observe whether the algorithm

was correct in either putting both questions in the same cluster or

keeping them separate.

Based on these accuracy results we decided on using LSA as our

clustering approach for the remainder of our experiments. We also

decided on taking the sample of 10K documents as the basis for our

analysis.

While we consider clustering of similar questions as a preprocess-

ing step for our approaches to sustainability, we can not ignore the

fact that obtaining a reasonable clustering performance is important

for our sustainability estimation. Therefore, we opt to manually

label data for further investigation, as our clustering methods per-

formed rather poor.

We manually divided the 904 clusters in the output of our LSA

clustering approach on the before mentioned subset of 10K questions

in three classes: 752 all clusters, 143 clusters with similar questions

and 7 clusters with sustainable questions.

−0.5 0.0 0.5 1.0 1.5Average cosine distance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Den

sity

allsimilarsustainable

Figure 5: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between answers labeled as best accord-ing to either the user or the community.

The clusters in the similar class are only required to have similar

questions—questions asking for the same information—regardless

of the answers; these clusters can thus be sustainable and unsustain-

able. Additionally, the clusters in the sustainable class are required

to have answers that do not change over time. Note that this defini-

tion implies that the sustainable class is a subset of the similar class

which is a subset of the all class.

Subsequently, for each cluster we compute cosine distances be-

tween chronologically sorted best answers, as described in section

3.2.1. For each set of distances (per cluster), we compute the aver-

age, standard deviation, average change per day, standard deviation

on the cumulative distances, as well as the slope and sum of squared

errors of a linearly fitted function on the cumulative distances.

Also, we compute for each cluster the time between the moment a

question was posted and the answer labeled as best answer, and the

time between the last answer that question received, as described in

section 3.2.2. For each set of these distances in time, we compute the

average, standard deviation, standard deviation on the cumulative

distances in time, as well as the slope and sum of squared errors of

a linearly fitted function on the cumulative distances in days.

4.2 ResultsFigure 5 shows a kernel density estimation

9plot of the average

cosine distance between the best answers for each class of clusters.

Although there seems to be some evidence for this metric to distin-

guish similar and sustainable clusters from regular clusters, it is not

that strong.

Figure 6 also shows a kernel density estimation plot, for the aver-

age cosine distance between the semanticized vector representations

of the best answers for each class of clusters. As the remarkable sim-

ilar plots for the distances between tf-idf vectors and semanticized

vector representations in the single cluster example in section 3.2.1

already suggested, semanticizing answers does not seem to be an

improvement on the traditional tf-idf bag-of-words representation.

Also, considering time to answer as a distinctive property of

sustainable questions does not yield decisive results. Figure 7 shows

a kernel density estimation of the average time in days between the

9We use kernel density estimation because it models the density

of data points at a value. In this way, a fairer comparison between

the instances of our three classes can be made; we have far less

sustainable than similar questions [37].

Accuracy of several question clustering methods. Missing values represent experiments that never terminated.

Page 12: Sustainable Questions

Compare answers in each cluster

• Answers to similar questions that do not change over time indicate sustainable questions

• Output of LSA contained 904 clusters:

• 9 clusters considered sustainable

• 143 clusters considered similar

• 756 clusters considered all

• Compute properties of question-answer pairs (change, time, number of answers, etc.)

Page 13: Sustainable Questions

Compare answers in each cluster

Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006−2

−1

0

1

2

3

4

5

6

7

8

Cum

ulat

ive

cosi

nedi

stan

ce

Linear fitted lineCumulative cosine distance

Figure 4: As in Figure 3, the cumulative cosine distance be-

tween semanticized representations of answers with linear fit-

ted line for a single cluster. However, here the timing of the

answers is taken in to account.

ition is that sustainable questions are more likely to solicit answerslonger after they were posted than non-sustainable questions; manyquestions are answered straightaway and disappear in the timelinequickly, whereas some questions keep getting attention, and aretherefore not expired (yet).

4. EXPERIMENTS

Our experiments are aimed at answering the following researchquestions. What are the distinguishing properties of sustainablequestions? Can we measure these properties of sustainability? Canwe tell sustainable and unsustainable questions apart based on theseproperties?

4.1 Experimental Setup

Yahoo! Answers is a question answering community website,where users can ask and answer questions. Users are encouragedto answer questions by rewarding points, with accompanying ranksand earnable badges.

4.1.1 DataAll our experiments are run on the Yahoo! Answers Compre-

hensive Questions and Answers version 1.06 dataset. This data setconsists of 4.5M questions with often multiple answers, of whichwe used 3.2M.7

We have sampled two sets from the training set in order to de-velop, test and obtain clusters of similar questions; given the avail-able resources we were not able to perform all clustering methods(discussed in section 3.1) on the complete training data set. Table 1shows some decriptive statistics of the two subsets and the completeset, indicating that on a superficial level, the distributions of ques-tions do not differ much. However, we do note that the amount ofdifferent languages grows with the size of the data set.

Also, we can see that questions and answers tend to be short. Al-though the information need is conveyed with a richer representation

6http://webscope.sandbox.yahoo.com/catalog.php?datatype=l7For validation purposes, we sorted the set by date, and then splitthe set (80% training set, 20% test set), and held 10% of the trainingset back as dev-test set. However, due to time constraints we neverused the held back data.

sample sizeStatistic 10K 100K all

Number of questions 10K 100K 3.2M

Average number of answers/question 7,1 7,1 7,1Std. dev. number of answers/question 7,4 7,2 8,1

Average number of characters/question 175,0 176,7 177,3Std. dev. of characters/question 204,2 200,0 201,7Median of characters/question 103 104 105Average number of characters/answer 332,8 336,5 336,0Std. dev. of characters/answer 507,6 503,7 499,6Median of characters/answer 168 175 177

Average number of sentences/question 2,8 2,8 2,9Std. dev. number of sentences/question 2,7 2,6 2,6Median number of sentences/questions 2 2 2Average number of sentences/answer 3,9 3,9 3,9Std. dev. number of sentences/answer 6,3 5,2 5,1Median number of sentences/answer 2 2 2

Question languages 6 12 28Main categories 163 176 179Categories 869 1744 2853Sub categories 677 1245 1539

Table 1: Descriptive statistics of the Yahoo! Answer data set.

The average number of answers is per question, the aver-

age question and answer length is in characters (spaces in-

cluded). Languages and categories are the amount of unique

occurences.

in natural language, we must also consider that our ‘documents’ arefar sparser as would be the case in a traditional retrieval setting, andcan influence our efforts to estimate similarity between questionsnegatively.

Additionally, we extracted several attributes of each question,such as the date the question was posted, was resolved, when thelast answer was solicited (and how much time was between askingand answering), how many answers were given, what the best answerwas (either chosen by the asker, or voted for by the community), orto which category it was assigned.

4.1.2 PreprocessingIn order to perform the change rate measures as described in

section 3.2.1, we employ two strategies to model the answer space.However, we first perform for each cluster of questions in the output(clustering methods are discussed in section 3.1) case and accentfolding and simple tokenisation on both questions and answers. Wethen model the answers as a tf-idf vector space [23].8

However, while this approach is a traditional, well-tested anddescribed approach in information retrieval, the vector space tendsto be very sparse. As Table 1 indicates, questions and answers tendto be very short. There are some elaborate answers in the corpus,skewing the average, but more than half of the questions and answersare represented by two sentences.

In addition, Yahoo! Answers community members represent theirquestions and answers in natural language, creating an abundanceof different spellings, synonyms and complexity. This results in aneven sparser vector space, as different spellings of the same word(including typographical errors) result in separate features in thevector space, while the same semantic meaning is intended.8We used the implementation from the scikit-learn package;http://scikit-learn.org/stable/index.html.

Page 14: Sustainable Questions

Compare answers in each cluster

sample size

algorithm 10K 100K all

LDA 0.435 0.500 -

LSA 0.706 0.638 -

LSH16bits 0.472 0.484 0.500

LSH24bits 0.465 0.502 0.495

LSH32bits 0.512 0.514 0.509

LSH40bits 0.523 0.537 0.542

Table 2: Accuracy of several question clustering methods. Miss-ing values represent experiments that never terminated.

In order to overcome vocabulary mismatch (different words with

the same meaning are used in either the question or the answer),

different ways of spelling, and improve overall matching on a se-

mantic level, we use the semantic linking system of Meij et al. [24],

developed to determine concepts in tweets.

This system approaches a similar problem; finding out what short

pieces of text are about. In operationalisation, pages on Wikipedia

are considered as concepts. Subsequently, a model is trained to

estimate the probability that a concept c is the target of an hyperlink

(in Wikipedia) witn an anchor text containing an ngram q. Given

a question or an answer, we obtain the set of concepts that are

likely to be linked to by occurence of ngrams in that question or

answer in Wikipedia anchor texts, as well as the score (a sum of

the probabilities of all ngrams in the piece of text linking to that

specific concept).

Using these concepts as document vectors, and their scores as

the values in those vectors, rather than tf-idf on the bag-of-words

representation of questions and answers, we hope to obtain a less

noisy vector space (change between answers is based solely on

concepts present in the text), and diminish the influence of spelling

and vocabulary mismatch. We will refer to documents processed

this way as ‘semanticized’.

We view the clustering of questions as a preprocessing step and

therefore take it as part of the experimental setup. We explore three

approaches to finding similar questions: latent semantic analysis [8],

latent Dirichlet allocation [4] and locality senstive hashing [6].

From the output of each clustering method on the 10K dataset,

we sampled 559 pairs of questions and manually labeled 205 as cor-

rectly clustered together and 354 as wrongly clustered together. We

used the combined set of labels (randomly sampling 205 questions

from the wrongly-clustered set) to arrive at the results in Table 2;

for each labeled pair of questions we observe whether the algorithm

was correct in either putting both questions in the same cluster or

keeping them separate.

Based on these accuracy results we decided on using LSA as our

clustering approach for the remainder of our experiments. We also

decided on taking the sample of 10K documents as the basis for our

analysis.

While we consider clustering of similar questions as a preprocess-

ing step for our approaches to sustainability, we can not ignore the

fact that obtaining a reasonable clustering performance is important

for our sustainability estimation. Therefore, we opt to manually

label data for further investigation, as our clustering methods per-

formed rather poor.

We manually divided the 904 clusters in the output of our LSA

clustering approach on the before mentioned subset of 10K questions

in three classes: 752 all clusters, 143 clusters with similar questions

and 7 clusters with sustainable questions.

−0.5 0.0 0.5 1.0 1.5Average cosine distance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Den

sity

allsimilarsustainable

Figure 5: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between answers labeled as best accord-ing to either the user or the community.

The clusters in the similar class are only required to have similar

questions—questions asking for the same information—regardless

of the answers; these clusters can thus be sustainable and unsustain-

able. Additionally, the clusters in the sustainable class are required

to have answers that do not change over time. Note that this defini-

tion implies that the sustainable class is a subset of the similar class

which is a subset of the all class.

Subsequently, for each cluster we compute cosine distances be-

tween chronologically sorted best answers, as described in section

3.2.1. For each set of distances (per cluster), we compute the aver-

age, standard deviation, average change per day, standard deviation

on the cumulative distances, as well as the slope and sum of squared

errors of a linearly fitted function on the cumulative distances.

Also, we compute for each cluster the time between the moment a

question was posted and the answer labeled as best answer, and the

time between the last answer that question received, as described in

section 3.2.2. For each set of these distances in time, we compute the

average, standard deviation, standard deviation on the cumulative

distances in time, as well as the slope and sum of squared errors of

a linearly fitted function on the cumulative distances in days.

4.2 ResultsFigure 5 shows a kernel density estimation

9plot of the average

cosine distance between the best answers for each class of clusters.

Although there seems to be some evidence for this metric to distin-

guish similar and sustainable clusters from regular clusters, it is not

that strong.

Figure 6 also shows a kernel density estimation plot, for the aver-

age cosine distance between the semanticized vector representations

of the best answers for each class of clusters. As the remarkable sim-

ilar plots for the distances between tf-idf vectors and semanticized

vector representations in the single cluster example in section 3.2.1

already suggested, semanticizing answers does not seem to be an

improvement on the traditional tf-idf bag-of-words representation.

Also, considering time to answer as a distinctive property of

sustainable questions does not yield decisive results. Figure 7 shows

a kernel density estimation of the average time in days between the

9We use kernel density estimation because it models the density

of data points at a value. In this way, a fairer comparison between

the instances of our three classes can be made; we have far less

sustainable than similar questions [37].

−0.5 0.0 0.5 1.0 1.5Average cosine distance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Den

sity

allsimilarsustainable

Figure 6: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between semanticized answers labeledas best according to either the user or the community.

posting of a question and that question being marked as resolved.

Almost all questions are answered within days of posting, although

similar and sustainable question clusters seem to incorporate more

questions that require longer to be answered satisfactory than regular

clusters. However, the distinction is not that clear.

−400 −200 0 200 400 600 800Days between question posted and best answer

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Den

sity

allsimilarsustainable

Figure 7: Kernel density estimation of average time in daysbetween posting of a question and the question being markedas resolved.

When we consider the time between the moment of posting a

question and the moment that question receives its final answer, we

see that questions we deem sustainable keep receiving answers far

longer than ‘regular’ or even similar questions. Figure 8 shows a

kernel density estimation10

plot for the time between the posting of

a question and the reception of its last answer. It should be noted

that the set of sustainable clusters is a subset of the set of similar

clusters, and that the set of similar clusters is a subset of the set

of all clusters. This explains the second local maximum in the ‘all

clusters’ line.

4.3 AnalysisWhen comparing a kernel density estimation of the average cosine

distance between the best answers to the questions in a cluster

10This is why the plot covers negative values for time as well.

−200 −100 0 100 200 300 400 500Days between question posted and last answer

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

Den

sity

allsimilarsustainable

Figure 8: Kernel density estimation of the average time in daysbetween posting of a question and the last answer a questionreceived.

(shown in Figure 5) with a kernel density estimation of the average

time in days between posting a question and that question receiving

its last answer (shown in Figure 8) we see that the time between the

posting of a question and receiving its last answer is very indicative

in describing sustainability: the longer a question solicits answers,

the higher the probability of said question to be sustainable.

In addition, from the simple properties (average, standard devi-

ation, slope, SSE; detailed in Section 4.1.2) of clusters, we con-

structed five feature sets, as listed in Table 3. These correspond to

approaches disscussed in Section 3.2; change per question (i.e. the

amount of change between sequential questions), change per ques-

tion normalised for time, and the change over time for semanticized

representations of questions, as well as the time between asking

and answering of questions (both between asking and labeling of

the best answer, and time between asking and reception of the last

answer). Also, we used a combination of the ‘change over time’ and

‘time to answer’ sets.

feature set accuracy

change per question 66,9%

change over time 86,0%

semanticized change over time 75,3%

time to answer 89,3%

change/time combination 91,5%

Table 3: Accuracy of different feature sets. ‘Change’ featuresets typically contain average, (cumulative) standard deviation,slope and SSE of change rates (detailed in Section 3.2.1). ‘Timeto answer’ contains the time in days between asking and an-swering a question (detailed in Section 3.2.2). ‘Combination’contains features from both ‘change over time’ and ‘time to an-swer’ (detailed in Section 4.1.2).

When training a simple tree classifier11

using the properties in

each property set as features—on re-sampled data to balance the

classes—we find that the combination of both change and time

features is capable of obtaining a classification accuracy of 91.5% in

stratified 10 fold cross-validation, indicating that using very simple

properties such as the time between a question and its last answer,

and the cosine distance between the answers over time allow for

a reasonable distinction between sustainable and non-sustainable

11We use the WEKA [14] implementation of C4.5 by Quinlan [27].

Page 15: Sustainable Questions

Classify clusters as sustainable

• Construct feature sets (change, change over time, time to answer)

• Train a classifier* on re-sampled data

• Accuracy in stratified 10-fold cross-validation:

−0.5 0.0 0.5 1.0 1.5Average cosine distance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Den

sity

allsimilarsustainable

Figure 6: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between semanticized answers labeledas best according to either the user or the community.

posting of a question and that question being marked as resolved.

Almost all questions are answered within days of posting, although

similar and sustainable question clusters seem to incorporate more

questions that require longer to be answered satisfactory than regular

clusters. However, the distinction is not that clear.

−400 −200 0 200 400 600 800Days between question posted and best answer

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Den

sity

allsimilarsustainable

Figure 7: Kernel density estimation of average time in daysbetween posting of a question and the question being markedas resolved.

When we consider the time between the moment of posting a

question and the moment that question receives its final answer, we

see that questions we deem sustainable keep receiving answers far

longer than ‘regular’ or even similar questions. Figure 8 shows a

kernel density estimation10

plot for the time between the posting of

a question and the reception of its last answer. It should be noted

that the set of sustainable clusters is a subset of the set of similar

clusters, and that the set of similar clusters is a subset of the set

of all clusters. This explains the second local maximum in the ‘all

clusters’ line.

4.3 AnalysisWhen comparing a kernel density estimation of the average cosine

distance between the best answers to the questions in a cluster

10This is why the plot covers negative values for time as well.

−200 −100 0 100 200 300 400 500Days between question posted and last answer

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

Den

sity

allsimilarsustainable

Figure 8: Kernel density estimation of the average time in daysbetween posting of a question and the last answer a questionreceived.

(shown in Figure 5) with a kernel density estimation of the average

time in days between posting a question and that question receiving

its last answer (shown in Figure 8) we see that the time between the

posting of a question and receiving its last answer is very indicative

in describing sustainability: the longer a question solicits answers,

the higher the probability of said question to be sustainable.

In addition, from the simple properties (average, standard devi-

ation, slope, SSE; detailed in Section 4.1.2) of clusters, we con-

structed five feature sets, as listed in Table 3. These correspond to

approaches disscussed in Section 3.2; change per question (i.e. the

amount of change between sequential questions), change per ques-

tion normalised for time, and the change over time for semanticized

representations of questions, as well as the time between asking

and answering of questions (both between asking and labeling of

the best answer, and time between asking and reception of the last

answer). Also, we used a combination of the ‘change over time’ and

‘time to answer’ sets.

feature set accuracy

change per question 66,9%

change over time 86,0%

semanticized change over time 75,3%

time to answer 89,3%

change/time combination 91,5%

Table 3: Accuracy of different feature sets. ‘Change’ featuresets typically contain average, (cumulative) standard deviation,slope and SSE of change rates (detailed in Section 3.2.1). ‘Timeto answer’ contains the time in days between asking and an-swering a question (detailed in Section 3.2.2). ‘Combination’contains features from both ‘change over time’ and ‘time to an-swer’ (detailed in Section 4.1.2).

When training a simple tree classifier11

using the properties in

each property set as features—on re-sampled data to balance the

classes—we find that the combination of both change and time

features is capable of obtaining a classification accuracy of 91.5% in

stratified 10 fold cross-validation, indicating that using very simple

properties such as the time between a question and its last answer,

and the cosine distance between the answers over time allow for

a reasonable distinction between sustainable and non-sustainable

11We use the WEKA [14] implementation of C4.5 by Quinlan [27].

*We use the WEKA (Hall et al., 2009) implementation of C4.5 by Quinlan (1993)

Page 16: Sustainable Questions

Conclusions

• Explored a new problem concerning sustainability and reusability of questions in a CQA setting

• Sustainability can be reasonably estimated by simple question properties, where time is most descriptive (RQ1)

• These properties can be obtained easily, also from data from other CQA services (RQ2)

• Using a simple classifier, these properties can be used to distinguish sustainable from non-sustainable questions (RQ3)

Page 17: Sustainable Questions

Future work

• Scaling (considered sample 3% of training set)

• Clustering:

• on answers (twice as long as questions)

• both (where do clusters of answers and questions ‘agree’?)

• retrieval approach

• Evaluation; does factoring in sustainability have a positive effect on precision?

Page 18: Sustainable Questions

Questions?

Page 19: Sustainable Questions

References

• D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3: 993–1022, March 2003. ISSN 1532-4435. URL http:// dl.acm.org/citation.cfm?id=944919.944937

• M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388. ACM, 2002.

• S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391–407, 1990.

• M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data mining software: an update. SIGKDD, 11(1):10–18, 2009.

• J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1993.• C. Shah, S. Oh, and J. Oh. Research agenda for social Q&A. Library & Information Science

Research, 31(4):205–209, 2009.

Page 20: Sustainable Questions

Data descriptives

Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006−2

−1

0

1

2

3

4

5

6

7

8

Cum

ulat

ive

cosi

nedi

stan

ce

Linear fitted lineCumulative cosine distance

Figure 4: As in Figure 3, the cumulative cosine distance be-

tween semanticized representations of answers with linear fit-

ted line for a single cluster. However, here the timing of the

answers is taken in to account.

ition is that sustainable questions are more likely to solicit answerslonger after they were posted than non-sustainable questions; manyquestions are answered straightaway and disappear in the timelinequickly, whereas some questions keep getting attention, and aretherefore not expired (yet).

4. EXPERIMENTS

Our experiments are aimed at answering the following researchquestions. What are the distinguishing properties of sustainablequestions? Can we measure these properties of sustainability? Canwe tell sustainable and unsustainable questions apart based on theseproperties?

4.1 Experimental Setup

Yahoo! Answers is a question answering community website,where users can ask and answer questions. Users are encouragedto answer questions by rewarding points, with accompanying ranksand earnable badges.

4.1.1 DataAll our experiments are run on the Yahoo! Answers Compre-

hensive Questions and Answers version 1.06 dataset. This data setconsists of 4.5M questions with often multiple answers, of whichwe used 3.2M.7

We have sampled two sets from the training set in order to de-velop, test and obtain clusters of similar questions; given the avail-able resources we were not able to perform all clustering methods(discussed in section 3.1) on the complete training data set. Table 1shows some decriptive statistics of the two subsets and the completeset, indicating that on a superficial level, the distributions of ques-tions do not differ much. However, we do note that the amount ofdifferent languages grows with the size of the data set.

Also, we can see that questions and answers tend to be short. Al-though the information need is conveyed with a richer representation

6http://webscope.sandbox.yahoo.com/catalog.php?datatype=l7For validation purposes, we sorted the set by date, and then splitthe set (80% training set, 20% test set), and held 10% of the trainingset back as dev-test set. However, due to time constraints we neverused the held back data.

sample sizeStatistic 10K 100K all

Number of questions 10K 100K 3.2M

Average number of answers/question 7,1 7,1 7,1Std. dev. number of answers/question 7,4 7,2 8,1

Average number of characters/question 175,0 176,7 177,3Std. dev. of characters/question 204,2 200,0 201,7Median of characters/question 103 104 105Average number of characters/answer 332,8 336,5 336,0Std. dev. of characters/answer 507,6 503,7 499,6Median of characters/answer 168 175 177

Average number of sentences/question 2,8 2,8 2,9Std. dev. number of sentences/question 2,7 2,6 2,6Median number of sentences/questions 2 2 2Average number of sentences/answer 3,9 3,9 3,9Std. dev. number of sentences/answer 6,3 5,2 5,1Median number of sentences/answer 2 2 2

Question languages 6 12 28Main categories 163 176 179Categories 869 1744 2853Sub categories 677 1245 1539

Table 1: Descriptive statistics of the Yahoo! Answer data set.

The average number of answers is per question, the aver-

age question and answer length is in characters (spaces in-

cluded). Languages and categories are the amount of unique

occurences.

in natural language, we must also consider that our ‘documents’ arefar sparser as would be the case in a traditional retrieval setting, andcan influence our efforts to estimate similarity between questionsnegatively.

Additionally, we extracted several attributes of each question,such as the date the question was posted, was resolved, when thelast answer was solicited (and how much time was between askingand answering), how many answers were given, what the best answerwas (either chosen by the asker, or voted for by the community), orto which category it was assigned.

4.1.2 PreprocessingIn order to perform the change rate measures as described in

section 3.2.1, we employ two strategies to model the answer space.However, we first perform for each cluster of questions in the output(clustering methods are discussed in section 3.1) case and accentfolding and simple tokenisation on both questions and answers. Wethen model the answers as a tf-idf vector space [23].8

However, while this approach is a traditional, well-tested anddescribed approach in information retrieval, the vector space tendsto be very sparse. As Table 1 indicates, questions and answers tendto be very short. There are some elaborate answers in the corpus,skewing the average, but more than half of the questions and answersare represented by two sentences.

In addition, Yahoo! Answers community members represent theirquestions and answers in natural language, creating an abundanceof different spellings, synonyms and complexity. This results in aneven sparser vector space, as different spellings of the same word(including typographical errors) result in separate features in thevector space, while the same semantic meaning is intended.8We used the implementation from the scikit-learn package;http://scikit-learn.org/stable/index.html.

Page 21: Sustainable Questions

Cluster properties

sample size

algorithm 10K 100K all

LDA 0.435 0.500 -

LSA 0.706 0.638 -

LSH16bits 0.472 0.484 0.500

LSH24bits 0.465 0.502 0.495

LSH32bits 0.512 0.514 0.509

LSH40bits 0.523 0.537 0.542

Table 2: Accuracy of several question clustering methods. Miss-ing values represent experiments that never terminated.

In order to overcome vocabulary mismatch (different words with

the same meaning are used in either the question or the answer),

different ways of spelling, and improve overall matching on a se-

mantic level, we use the semantic linking system of Meij et al. [24],

developed to determine concepts in tweets.

This system approaches a similar problem; finding out what short

pieces of text are about. In operationalisation, pages on Wikipedia

are considered as concepts. Subsequently, a model is trained to

estimate the probability that a concept c is the target of an hyperlink

(in Wikipedia) witn an anchor text containing an ngram q. Given

a question or an answer, we obtain the set of concepts that are

likely to be linked to by occurence of ngrams in that question or

answer in Wikipedia anchor texts, as well as the score (a sum of

the probabilities of all ngrams in the piece of text linking to that

specific concept).

Using these concepts as document vectors, and their scores as

the values in those vectors, rather than tf-idf on the bag-of-words

representation of questions and answers, we hope to obtain a less

noisy vector space (change between answers is based solely on

concepts present in the text), and diminish the influence of spelling

and vocabulary mismatch. We will refer to documents processed

this way as ‘semanticized’.

We view the clustering of questions as a preprocessing step and

therefore take it as part of the experimental setup. We explore three

approaches to finding similar questions: latent semantic analysis [8],

latent Dirichlet allocation [4] and locality senstive hashing [6].

From the output of each clustering method on the 10K dataset,

we sampled 559 pairs of questions and manually labeled 205 as cor-

rectly clustered together and 354 as wrongly clustered together. We

used the combined set of labels (randomly sampling 205 questions

from the wrongly-clustered set) to arrive at the results in Table 2;

for each labeled pair of questions we observe whether the algorithm

was correct in either putting both questions in the same cluster or

keeping them separate.

Based on these accuracy results we decided on using LSA as our

clustering approach for the remainder of our experiments. We also

decided on taking the sample of 10K documents as the basis for our

analysis.

While we consider clustering of similar questions as a preprocess-

ing step for our approaches to sustainability, we can not ignore the

fact that obtaining a reasonable clustering performance is important

for our sustainability estimation. Therefore, we opt to manually

label data for further investigation, as our clustering methods per-

formed rather poor.

We manually divided the 904 clusters in the output of our LSA

clustering approach on the before mentioned subset of 10K questions

in three classes: 752 all clusters, 143 clusters with similar questions

and 7 clusters with sustainable questions.

−0.5 0.0 0.5 1.0 1.5Average cosine distance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Den

sity

allsimilarsustainable

Figure 5: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between answers labeled as best accord-ing to either the user or the community.

The clusters in the similar class are only required to have similar

questions—questions asking for the same information—regardless

of the answers; these clusters can thus be sustainable and unsustain-

able. Additionally, the clusters in the sustainable class are required

to have answers that do not change over time. Note that this defini-

tion implies that the sustainable class is a subset of the similar class

which is a subset of the all class.

Subsequently, for each cluster we compute cosine distances be-

tween chronologically sorted best answers, as described in section

3.2.1. For each set of distances (per cluster), we compute the aver-

age, standard deviation, average change per day, standard deviation

on the cumulative distances, as well as the slope and sum of squared

errors of a linearly fitted function on the cumulative distances.

Also, we compute for each cluster the time between the moment a

question was posted and the answer labeled as best answer, and the

time between the last answer that question received, as described in

section 3.2.2. For each set of these distances in time, we compute the

average, standard deviation, standard deviation on the cumulative

distances in time, as well as the slope and sum of squared errors of

a linearly fitted function on the cumulative distances in days.

4.2 ResultsFigure 5 shows a kernel density estimation

9plot of the average

cosine distance between the best answers for each class of clusters.

Although there seems to be some evidence for this metric to distin-

guish similar and sustainable clusters from regular clusters, it is not

that strong.

Figure 6 also shows a kernel density estimation plot, for the aver-

age cosine distance between the semanticized vector representations

of the best answers for each class of clusters. As the remarkable sim-

ilar plots for the distances between tf-idf vectors and semanticized

vector representations in the single cluster example in section 3.2.1

already suggested, semanticizing answers does not seem to be an

improvement on the traditional tf-idf bag-of-words representation.

Also, considering time to answer as a distinctive property of

sustainable questions does not yield decisive results. Figure 7 shows

a kernel density estimation of the average time in days between the

9We use kernel density estimation because it models the density

of data points at a value. In this way, a fairer comparison between

the instances of our three classes can be made; we have far less

sustainable than similar questions [37].

Page 22: Sustainable Questions

Cluster properties

−0.5 0.0 0.5 1.0 1.5Average cosine distance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Den

sity

allsimilarsustainable

Figure 6: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between semanticized answers labeledas best according to either the user or the community.

posting of a question and that question being marked as resolved.

Almost all questions are answered within days of posting, although

similar and sustainable question clusters seem to incorporate more

questions that require longer to be answered satisfactory than regular

clusters. However, the distinction is not that clear.

−400 −200 0 200 400 600 800Days between question posted and best answer

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Den

sity

allsimilarsustainable

Figure 7: Kernel density estimation of average time in daysbetween posting of a question and the question being markedas resolved.

When we consider the time between the moment of posting a

question and the moment that question receives its final answer, we

see that questions we deem sustainable keep receiving answers far

longer than ‘regular’ or even similar questions. Figure 8 shows a

kernel density estimation10

plot for the time between the posting of

a question and the reception of its last answer. It should be noted

that the set of sustainable clusters is a subset of the set of similar

clusters, and that the set of similar clusters is a subset of the set

of all clusters. This explains the second local maximum in the ‘all

clusters’ line.

4.3 AnalysisWhen comparing a kernel density estimation of the average cosine

distance between the best answers to the questions in a cluster

10This is why the plot covers negative values for time as well.

−200 −100 0 100 200 300 400 500Days between question posted and last answer

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

Den

sity

allsimilarsustainable

Figure 8: Kernel density estimation of the average time in daysbetween posting of a question and the last answer a questionreceived.

(shown in Figure 5) with a kernel density estimation of the average

time in days between posting a question and that question receiving

its last answer (shown in Figure 8) we see that the time between the

posting of a question and receiving its last answer is very indicative

in describing sustainability: the longer a question solicits answers,

the higher the probability of said question to be sustainable.

In addition, from the simple properties (average, standard devi-

ation, slope, SSE; detailed in Section 4.1.2) of clusters, we con-

structed five feature sets, as listed in Table 3. These correspond to

approaches disscussed in Section 3.2; change per question (i.e. the

amount of change between sequential questions), change per ques-

tion normalised for time, and the change over time for semanticized

representations of questions, as well as the time between asking

and answering of questions (both between asking and labeling of

the best answer, and time between asking and reception of the last

answer). Also, we used a combination of the ‘change over time’ and

‘time to answer’ sets.

feature set accuracy

change per question 66,9%

change over time 86,0%

semanticized change over time 75,3%

time to answer 89,3%

change/time combination 91,5%

Table 3: Accuracy of different feature sets. ‘Change’ featuresets typically contain average, (cumulative) standard deviation,slope and SSE of change rates (detailed in Section 3.2.1). ‘Timeto answer’ contains the time in days between asking and an-swering a question (detailed in Section 3.2.2). ‘Combination’contains features from both ‘change over time’ and ‘time to an-swer’ (detailed in Section 4.1.2).

When training a simple tree classifier11

using the properties in

each property set as features—on re-sampled data to balance the

classes—we find that the combination of both change and time

features is capable of obtaining a classification accuracy of 91.5% in

stratified 10 fold cross-validation, indicating that using very simple

properties such as the time between a question and its last answer,

and the cosine distance between the answers over time allow for

a reasonable distinction between sustainable and non-sustainable

11We use the WEKA [14] implementation of C4.5 by Quinlan [27].

Page 23: Sustainable Questions

Cluster properties

−0.5 0.0 0.5 1.0 1.5Average cosine distance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Den

sity

allsimilarsustainable

Figure 6: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between semanticized answers labeledas best according to either the user or the community.

posting of a question and that question being marked as resolved.

Almost all questions are answered within days of posting, although

similar and sustainable question clusters seem to incorporate more

questions that require longer to be answered satisfactory than regular

clusters. However, the distinction is not that clear.

−400 −200 0 200 400 600 800Days between question posted and best answer

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Den

sity

allsimilarsustainable

Figure 7: Kernel density estimation of average time in daysbetween posting of a question and the question being markedas resolved.

When we consider the time between the moment of posting a

question and the moment that question receives its final answer, we

see that questions we deem sustainable keep receiving answers far

longer than ‘regular’ or even similar questions. Figure 8 shows a

kernel density estimation10

plot for the time between the posting of

a question and the reception of its last answer. It should be noted

that the set of sustainable clusters is a subset of the set of similar

clusters, and that the set of similar clusters is a subset of the set

of all clusters. This explains the second local maximum in the ‘all

clusters’ line.

4.3 AnalysisWhen comparing a kernel density estimation of the average cosine

distance between the best answers to the questions in a cluster

10This is why the plot covers negative values for time as well.

−200 −100 0 100 200 300 400 500Days between question posted and last answer

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

Den

sity

allsimilarsustainable

Figure 8: Kernel density estimation of the average time in daysbetween posting of a question and the last answer a questionreceived.

(shown in Figure 5) with a kernel density estimation of the average

time in days between posting a question and that question receiving

its last answer (shown in Figure 8) we see that the time between the

posting of a question and receiving its last answer is very indicative

in describing sustainability: the longer a question solicits answers,

the higher the probability of said question to be sustainable.

In addition, from the simple properties (average, standard devi-

ation, slope, SSE; detailed in Section 4.1.2) of clusters, we con-

structed five feature sets, as listed in Table 3. These correspond to

approaches disscussed in Section 3.2; change per question (i.e. the

amount of change between sequential questions), change per ques-

tion normalised for time, and the change over time for semanticized

representations of questions, as well as the time between asking

and answering of questions (both between asking and labeling of

the best answer, and time between asking and reception of the last

answer). Also, we used a combination of the ‘change over time’ and

‘time to answer’ sets.

feature set accuracy

change per question 66,9%

change over time 86,0%

semanticized change over time 75,3%

time to answer 89,3%

change/time combination 91,5%

Table 3: Accuracy of different feature sets. ‘Change’ featuresets typically contain average, (cumulative) standard deviation,slope and SSE of change rates (detailed in Section 3.2.1). ‘Timeto answer’ contains the time in days between asking and an-swering a question (detailed in Section 3.2.2). ‘Combination’contains features from both ‘change over time’ and ‘time to an-swer’ (detailed in Section 4.1.2).

When training a simple tree classifier11

using the properties in

each property set as features—on re-sampled data to balance the

classes—we find that the combination of both change and time

features is capable of obtaining a classification accuracy of 91.5% in

stratified 10 fold cross-validation, indicating that using very simple

properties such as the time between a question and its last answer,

and the cosine distance between the answers over time allow for

a reasonable distinction between sustainable and non-sustainable

11We use the WEKA [14] implementation of C4.5 by Quinlan [27].

Page 24: Sustainable Questions

Cluster properties

We sort the list of hashes generated from our question input and

then pass over the list of hashes, while computing the Hamming

distance [15] between consecutive items. The first hash is initialised

to a cluster. Then, we proceed to the second hash. If the current hash

has three or less positions that are different from the previous hash,

it is appended to that cluster, otherwise a new cluster is initialised.

In doing so, clustering in linear time can be achieved, although

at the loss of some precision. Succesful application to detecting

near-duplicate pages for web crawling [22] and first story detection

on Twitter have been reported [26].

3.2 Measuring Sustainability

The operationalisation of our definition of sustainable questions,

as described in section 2.4, implies that we first have to identify

similar questions; in order to estimate to what extent the answers

to very similar or the same questions change over time, we need to

find sets of such similar questions.

3.2.1 Change rate of answersFor each cluster of questions we create a tf-idf vector space of the

answers labeled as ‘best answer’ by either the question asker or the

community. Subsequently, we fit a linear function on the cumulative

cosine distances between the answers (as shown in Figure 1), as

well as the cumulative cosine distances between answers over time

(shown in Figure 2). Figure 1 shows a rather constant change in

answers over time, whereas Figure 2 displays differences in the

speed of change between different answers over time, suggesting

that time might be an important factor in determining the evolution

in answers to questions over time. Figure 1 and Figure 2 show the

same cluster.

0 1 2 3 4 5 6 7 8−1

0

1

2

3

4

5

6

7

Cum

ulat

ive

cosi

nedi

stan

ce

Cumulative cosine distanceLinear fitted line

Figure 1: Cumulative cosine distance between vector represen-

tations of answers with a linear fitted line for a single cluster.

For the 9 best answers in this cluster, the theoretical maximum

of the cumulative distance is 8.

The idea behind this approach is that the slope of this linear

function will provide an indication how fast the answers to a set

of similar questions change. Also, the sum of squared errors for

this function given the dataset might provide clues to periodicity; if

the answers to similar questions exhibit large amounts of change in

short periods of time, that might indicate that the subject of these

questions are subject to periodic changes (for example, ‘who is theworld champion soccer’ is expected to change suddenly at periodic

time intervals). Additionally, we compute the standard deviation of

the set of distances.

We also represented our set of answers as semanticized vectors.

This approach will be described more in-depth in section 4.1.2.

Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006−2

−1

0

1

2

3

4

5

6

7

Cum

ulat

ive

cosi

nedi

stan

ce

Cumulative cosine distanceLinear fitted line

Figure 2: As in Figure 1, the cumulative cosine distance be-

tween vector representations of answers with linear fitted line

for a single cluster. However, here the timing of the answers is

taken in to account.

0 1 2 3 4 5 6 7 8−1

0

1

2

3

4

5

6

7

8

Cum

ulat

ive

cosi

nedi

stan

ce

Linear fitted lineCumulative cosine distance

Figure 3: Cumulative cosine distance between semanticized

representations of answers with a linear fitted line for a single

cluster. For the 9 best answers in this cluster, the theoretical

maximum of the cumulative distance is 8.

Figure 3 and Figure 4 display the results of the same approach

applied to the tf-idf vector representations used on the semanticized

answers.

Both figures represent the same cluster of questions shown in

figures 1 and 2, and show a remarkably similar graph, for both the

change and change over time approach; only small difference, such

as at the second question, can be observed.

3.2.2 Speed of responseAnother property that might be indicative to the sustainability

of a question, is the time it takes for a question to be answered.

The intuition here is that the probability of a sustainable question

soliciting answers over a longer period of time would be higher, as

the question would still be relevant.

For each cluster, we computed the average time for a question to

be resolved in days (i.e. the time between the posting of a question

and the posting of the best answer). Also, we computed the standard

deviation in answering time, as well as the total amount of days

questions in a cluster had to ‘wait’ for their best answer.

In addition, we computed the average time in days between the

posting of a question, and the last answer it received. The intu-

Page 25: Sustainable Questions

Cluster properties

Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006−2

−1

0

1

2

3

4

5

6

7

8

Cum

ulat

ive

cosi

nedi

stan

ce

Linear fitted lineCumulative cosine distance

Figure 4: As in Figure 3, the cumulative cosine distance be-

tween semanticized representations of answers with linear fit-

ted line for a single cluster. However, here the timing of the

answers is taken in to account.

ition is that sustainable questions are more likely to solicit answerslonger after they were posted than non-sustainable questions; manyquestions are answered straightaway and disappear in the timelinequickly, whereas some questions keep getting attention, and aretherefore not expired (yet).

4. EXPERIMENTS

Our experiments are aimed at answering the following researchquestions. What are the distinguishing properties of sustainablequestions? Can we measure these properties of sustainability? Canwe tell sustainable and unsustainable questions apart based on theseproperties?

4.1 Experimental Setup

Yahoo! Answers is a question answering community website,where users can ask and answer questions. Users are encouragedto answer questions by rewarding points, with accompanying ranksand earnable badges.

4.1.1 DataAll our experiments are run on the Yahoo! Answers Compre-

hensive Questions and Answers version 1.06 dataset. This data setconsists of 4.5M questions with often multiple answers, of whichwe used 3.2M.7

We have sampled two sets from the training set in order to de-velop, test and obtain clusters of similar questions; given the avail-able resources we were not able to perform all clustering methods(discussed in section 3.1) on the complete training data set. Table 1shows some decriptive statistics of the two subsets and the completeset, indicating that on a superficial level, the distributions of ques-tions do not differ much. However, we do note that the amount ofdifferent languages grows with the size of the data set.

Also, we can see that questions and answers tend to be short. Al-though the information need is conveyed with a richer representation

6http://webscope.sandbox.yahoo.com/catalog.php?datatype=l7For validation purposes, we sorted the set by date, and then splitthe set (80% training set, 20% test set), and held 10% of the trainingset back as dev-test set. However, due to time constraints we neverused the held back data.

sample sizeStatistic 10K 100K all

Number of questions 10K 100K 3.2M

Average number of answers/question 7,1 7,1 7,1Std. dev. number of answers/question 7,4 7,2 8,1

Average number of characters/question 175,0 176,7 177,3Std. dev. of characters/question 204,2 200,0 201,7Median of characters/question 103 104 105Average number of characters/answer 332,8 336,5 336,0Std. dev. of characters/answer 507,6 503,7 499,6Median of characters/answer 168 175 177

Average number of sentences/question 2,8 2,8 2,9Std. dev. number of sentences/question 2,7 2,6 2,6Median number of sentences/questions 2 2 2Average number of sentences/answer 3,9 3,9 3,9Std. dev. number of sentences/answer 6,3 5,2 5,1Median number of sentences/answer 2 2 2

Question languages 6 12 28Main categories 163 176 179Categories 869 1744 2853Sub categories 677 1245 1539

Table 1: Descriptive statistics of the Yahoo! Answer data set.

The average number of answers is per question, the aver-

age question and answer length is in characters (spaces in-

cluded). Languages and categories are the amount of unique

occurences.

in natural language, we must also consider that our ‘documents’ arefar sparser as would be the case in a traditional retrieval setting, andcan influence our efforts to estimate similarity between questionsnegatively.

Additionally, we extracted several attributes of each question,such as the date the question was posted, was resolved, when thelast answer was solicited (and how much time was between askingand answering), how many answers were given, what the best answerwas (either chosen by the asker, or voted for by the community), orto which category it was assigned.

4.1.2 PreprocessingIn order to perform the change rate measures as described in

section 3.2.1, we employ two strategies to model the answer space.However, we first perform for each cluster of questions in the output(clustering methods are discussed in section 3.1) case and accentfolding and simple tokenisation on both questions and answers. Wethen model the answers as a tf-idf vector space [23].8

However, while this approach is a traditional, well-tested anddescribed approach in information retrieval, the vector space tendsto be very sparse. As Table 1 indicates, questions and answers tendto be very short. There are some elaborate answers in the corpus,skewing the average, but more than half of the questions and answersare represented by two sentences.

In addition, Yahoo! Answers community members represent theirquestions and answers in natural language, creating an abundanceof different spellings, synonyms and complexity. This results in aneven sparser vector space, as different spellings of the same word(including typographical errors) result in separate features in thevector space, while the same semantic meaning is intended.8We used the implementation from the scikit-learn package;http://scikit-learn.org/stable/index.html.