Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure...

26
Download Estimation for KDD Cup 2003 Janez Brank and Jure Leskovec Jožef Stefan Institute Ljubljana, Slovenia
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure...

Page 1: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Download Estimation for KDD Cup 2003

Janez Brank and Jure LeskovecJožef Stefan InstituteLjubljana, Slovenia

Page 2: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Task Description

Inputs: Approx. 29000 papers from the “high energy

physics – theory” area of arxiv.org For each paper:

Full text (TeX file, often very messy) Metadata in a nice, structured file (authors,

title, abstract, journal, subject classes) The citation graph (excludes citations pointing

outside our dataset)

Page 3: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Task Description

Inputs (continued): For papers from 6 months

(the training set, 1566 papers) The number of times this paper was downloaded

during its first two months in the archive

Problem: For papers from 3 months (the test set,

678 papers), predict the number of downloads in their first two months in the archive

Only the 50 most frequently downloaded papers from each month will be used for evaluation!

Page 4: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Our Approach

Textual documents have traditionally been treated as “bags of words” The number of occurrences of each word matters,

but the order of the words is ignored Efficiently represented by sparse vectors

We extend this to include other items besides words (“bag of X”) Most of our work was spent trying various features

and adjusting their weight (more on that later) Use support vector regression to train a linear

model, which is then used to predict the download counts on test papers

Page 5: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

A Few Initial Observations

Our predictions will be evaluated on 50 most downloaded papers from each month — about 20% of all papers from these months It’s OK to be horribly wrong on other papers Thus we should be optimistic, treating every

paper as if it was in the top 20% Maybe we should train the model using only

20% of the most downloaded training papers Actually, 30% usually works a little better To evaluate a classifier, we look at 20% of the

most downloaded test papers

Page 6: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Cross-Validation

Model

Lather, rinse, repeat (10 times)

9 folds (approx. 1409) 1 fold (approx. 157)

30% most frequentlydownloaded

(approx. 423 papers)

20% most frequentlydownloaded

(approx. 31 papers)

Train

Evaluate

Report average

Split into 10 folds

Labeled papers (1566)

Page 7: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

A Few Initial Observations

We are interested in the downloads within 60 days since inclusion in the archive Most of the downloads

occur within the first fewdays, perhaps a week

Most are probably comingfrom the “What’s new” page, which contains only:

Author names Institution name (rarely) Title Abstract

Citations probably don’t directly influence downloads in the first 60 days

But they show which papers are good, and the readers perhaps sense this in some other way from the authors / title / abstract

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Day since the paper was added to the archive

Ave

rag

e n

um

ber

of

do

wn

load

s o

n t

hat

day

Page 8: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

The Rock Bottom

The trivial model: always predictthe average download count (computed on the training data) Average download count: 384.2 Average error: 152.5 downloads

Page 9: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Abstract

Abstract: use the text of the abstract and title of the paper in the traditional bag-of-words style 19912 features No further feature selection etc. This part of the vector was normalized

to unit length (Euclidean norm = 1) Average error: 149.4

Page 10: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Author

One attribute for each possible author Preprocessing to tidy up the original

metadata:Y.S. Myung and Gungwon Kangmyung-y kang-g

xa = nonzero iff. a is one of the authors of the paper x

This part is normalized to unit length 5716 features Average error: 146.4

Page 11: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Address

Intuition: people are more likely to download a paper if the authors are from a reputable institution Admittedly, the “What’s new” page usually

doesn’t mention the institution Nor is it provided in the metadata,

we had to extract it from TeX files (messy!) Words from the address are represented using the

bag-of-words model But they get their own namespace,

separate from the abstract and title words This part of the vector is also normalized

to unit length Average error: 154.0 ( worse than useless)

Page 12: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Abstract, Author, Address

63.7

62.4

80.5

37.6

49.2

42.2

32.3

146.4

149.4

154.0

135.8

143.3

142.9

136.5

0.0 50.0 100.0 150.0 200.0

Author

Abstract

Address

Author Abstract

Author Address

Abstract Address

All three

Training set Test set

We used Author + Abstract (“AA” for short) as the baseline for adding new features

Page 13: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Using the Citation Graph

InDegree, OutDegree These are quite large in comparison to the text-

based features (average indegree = approx. 10) We must use weighting, otherwise they will appear

too important to the learner

127.62

127

128

129

130

131

132

133

134

135

136

137

0 0.002 0.004 0.006 0.008 0.01

Weight of InDegree

Av

era

ge

err

or

on

te

st

se

t

InDegree is useful

OutDegree is largely useless (which is reasonable)

AA + InDegree

Page 14: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Using the Citation Graph

InLinks = add one feature for each paper i; it will be nonzero in vector x iff. the paper x is referenced by the paper i Normalize this part of the vector to unit length

OutLinks = the same, nonzero iff. x references i

(results on next slide)

Page 15: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

InDegree, InLinks, OutLinks

37.62

30.19

30.27

26.77

28.33

27.95

30.54

135.81

131.93

132.47

131.11

130.69

124.35

123.73

0 20 40 60 80 100 120 140 160

AA

AA + InLinks

AA + OutLinks

AA + InLinks + OutLinks

AA + 0.8 InLinks + 0.9 OutLinks

AA + 0.004 InDeg + 0.8 InLinks + 0.9 OutLinks

AA + 0.005 InDeg + 0.5 InLinks + 0.7 OutLinks

Training set Test set

Page 16: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Using the Citation Graph

Use HITS to compute a hub value and an authority value for each paper ( two new features)

Compute PageRank and add this as a new feature Bad: all links point backwards in time

(unlike on the web) — PageRank accumulates in the earlier years

InDegree, Authority, and PageRank are strongly correlated, no improvement over previous results

Hub is strongly correlated with OutDegree, and is just as useless

Page 17: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Journal

The “Journal” field in the metadata indicates that the paper has been (or will be?) published in a journal Present in about 77% of the papers Already in standardized form, e.g. “Phys. Lett.”

(never “Physics Letters”, “Phys. Letters”, etc.) There are over 50 journals, but only 4 have more

than 100 training papers Papers from some journals are downloaded

more often than from others: JHEP 248, J. Phys. 104, global average 194

Introduce one binary feature for each journal(+ one for “missing”)

Page 18: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Journal

134.95

121.16

120

122

124

126

128

130

132

134

136

138

140

0 0.2 0.4 0.6 0.8 1

Weight of the Journal attribute

Av

era

ge

err

or

on

th

e t

es

t s

et

AA + Journal AA + 0.005 InDeg + 0.5 InLinks + 0.7 OutLinks + Journal

Page 19: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Miscellaneous Statistics

TitleCc, TitleWc: number of characters/words in the title The most frequently downloaded

papers have relatively short titles:

The holographic principle (2927 downloads)Twenty Years of Debate with Stephen (1540)Brane New World (1351)A tentative theory of large distance physics (1351)(De)Constructing Dimensions (1343)Lectures on supergravity (1308)A Short Survey of Noncommutative Geometry (1246)

Page 20: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Miscellaneous Statistics

Average error: 119.561 for weight = 0.02 The model says that the number of downloads decreases by

0.96 for each additional letter in the title :-) TitleWc is useless

119.4

119.6

119.8

120

120.2

120.4

120.6

120.8

121

121.2

121.4

0 0.05 0.1 0.15 0.2 0.25 0.3

Weight of TitleCc/5

Av

era

ge

err

or

on

th

e t

es

t s

et

Page 21: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Miscellaneous Statistics

AbstractCc, AbstractWc: number of characters/words in the abstract Both useless

Number of authors (useless) Year (actually Year – 2000)

Almost useless (reduces error from 119.56 to 119.28)

Page 22: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Clustering

Each paper was represented by a sparse vector (bag-of-words, using the abstract + title)

Use 2-means to split into two clusters, then split each of them recursively Stop splitting if one of the two clusters would have

< 600 documents We ended up with 18 clusters

Hard to say if they’re meaningful (ask a physicist?) Introduce one binary feature for each cluster

(useless) Also a feature (ClusDlAvg) to contain the average

no. of downloads over all the training documents from the same cluster Reduces error from 119.59 to 119.30

Page 23: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Tweaking and Tuning

AA + 0.005 InDegree + 0.5 InLinks + 0.7 OutLinks + 0.3 Journal + 0.02 TitleCc/5 + 0.6 (Year – 2000) + 0.15 ClusDlAvg: 29.544 / 119.072

The “C” parameter for SVM regression was fixed at 1 so far

C = 0.7, AA + 0.006 InDegree + 0.7 InLinks + 0.85 OutLinks + 0.35 Journal + 0.03 TitleCc/5 + 0.3 ClusDlAvg: 31.805 / 118.944 This is the one we submitted

Page 24: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

A Look Back…

150.1

37.6

36.0

30.5

29.9

29.7

31.8

152.5

135.8

127.6

123.7

121.2

119.6

118.9

0 50 100 150 200

Trivial model

Author + Abstract

+ InDegree

+ InLinks + OutLinks

+ Journal

+ TitleCc/5

Best model found

Average error on the training set on the test set

Page 25: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Conclusions

It’s a nasty dataset! The best model is still disappointingly inaccurate …and not so much better than the trivial model Weighting the features is very important We tried several other features (not mentioned in

this presentation) that were of no use Whatever you do, there’s still so much variance left

SVM learns well enough here, but it can’t generalize well It isn’t the trivial sort of overfitting that could be

removed simply by decreasing the C parameter in SVM’s optimization problem

Page 26: Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Further Work

What is it that influences readers’ decisions to download a paper? We are mostly using things they can see

directly: author, title, abstract But readers are also influenced by their

background knowledge: Is X currently a hot topic within this

community? ( Will reading this paper help me with my own research?)

Is Y a well-known author? How likely is the paper to be any good?

It isn’t easy to catch these things,and there is a risk of ovefitting