R Package Recommendation
Transcript of R Package Recommendation
-
7/28/2019 R Package Recommendation
1/4
R package
RecommendationYingying Xu
Abstract: New comers to R programming language always face the
problem of choosing packages that are more relevant and are of
higher qualities. An automated R package recommendation system
would be useful in telling what packages average people have
installed, and the measurable qualities of the packages. When
building a recommendation system, we can interested in the
probability of a user installing a package, so in this paper, various
features that might influence the users decision are explored, and
several statistical learning methods are experimented.
1. IntroductionEach programming language contains a large number of
libraries/packages that extend the functionality of the core
language. The fluent use of a number of libraries is importantfor every programmer as well as the mastery of the basic syntax.
New comers to a programming language always face the
problem of choosing libraries that are more relevant and are of
higher qualities. Inspecting each library manually by looking at
its functionality descriptions can be a daunting task, and they
provide little information on the quality of the library. Thus, an
automated package recommendation system would be useful in
telling the programmer what packages other people have
installed, and the measurable properties that generate this result.
R is a language and software for statistical computing andgraphics. In this paper, we will try several methods for building
an R package recommendation engine. A popular website for R,
CRAN, is a network of ftp and web servers around the worldthat stores code and documentation for R. On CRAN, each R
package summary contains a short description of what the
package does, the authors and maintainer of the package, the
imports and dependencies, and other packages it suggests. The
rich meta-data each R package contains enables us to explore
many potential relationships and make fairly accurate
predictions on whether a user will install a package
1. Data Description1.1 Network GraphIn R package network, there are several entities, package,
topic/task view, author, maintainer, and user. Topic/task view
enables user to browse packages by its topic. There are currently29 available topics/task views on CRAN. The links between
packages are imports, depend and suggests. A package is written
by several authors, maintained by one maintainer (maintainer is
usually one of the authors), be in one or several topics/task
views, and installed by users. A maintainer can be maintaining
one or several packages at the same time. We also know
whether a package is a recommended package on CRAN, and
whether it is a core package.
1.2 Training dataThe training data we used are taken from the Dataists Rrecommendation system contest
(https://github.com/johnmyleswhite/r_recommendation_system). The graph data were obtained by crawling the entire CRAN
website and the predictors were derived from the rich meta-data.
The training data contains 99,640 rows of data describing
installation information for 1865 packages for 52 users of R. Its
a matrix with each row provides the following information:
Package: The name of the current R package.
User: The numeric ID of the current user who may or may not
have installed the current package.
Installed: A dummy variable indicating whether the current
package is installed by the current user.
DependencyCount: The number of other R packages that dependupon the current package.
SuggestionCount: The number of other R packages that suggest
the current package.
ImportCount: The number of other R packages that import the
current package.
ViewsIncluding: The number of task views on CRAN that
include the current package.
CorePackage: A dummy variable indicating whether the current
package is part of core R.
RecommendedPackage: A dummy variable indicating whetherthe current package is a recommended R package.
Maintainer: The name and e-mail address of the package's
maintainer.
PackagesMaintaining: The number of other R packages that are
being maintained by the current package's maintainer
Besides the given features, we can also use the open source
website crawler to crawl the entire CRAN website and derive
more informative links from the network graph, such as the
actual name of each package that depend on, imports, or
suggests the current package, and the name of the task views
that the current package appears. This will be discussed in the
following section.
2. Feature SelectionThere are several intuitions and assumptions in predicting
whether a user will install an R package in this paper. A
package !is more likely to be installed by user! if
-
7/28/2019 R Package Recommendation
2/4
-
7/28/2019 R Package Recommendation
3/4
f. Let !!!! = !| ! |3. The predicted value for an input x is
!
!
!!! !
3.4 Nave BayesNave Bayes assumes that conditioning on the class label, each
feature are conditionally independent from each other, i.e.,
= (|) =(!"#$%$"#$"&$!""#$%&'()) (!|)!
!!!
Where y is the class label, is the feature vector, !is each
feature, p is the total number of features, is a normalizingfactor ensuring = 1 +P (y= -1|)=1.
is calculated for each sample point, and y is assigned to
the class that has the higher probability.
Despite the naive assumption of this model, it achieves goodAUC score in the experiment.
3.5 SVMSVM with linear, Quadratic and Gaussian kernel areexperimented, and quadratic kernel returns the best AUC.
Linear Kernel:
, != (, )
Gaussian Radial Basis Function Kernel:
, != exp(|| ||!)
The Rational Quadratic kernel
, ! = 1 || ||!
| !||+
The Rational Quadratic Kernel is less computationally intensive
than the Gaussian kernel
4. Experiment4.1 Data preprocessingThe training data contains a large number of rows that has
missing class labels (we dont know whether a user installed the
package). Since there are much more (10 times more) training
data in class 1 (not installed) than in class -1 (installed), the
missing class labels are simply replaced by 1.
4.2 PredictorsIn logistic regression, the final predictors used in the
experiments are:Package, User, DependencyScore,
SuggestionScore, DependencyScore, ViewsIncluding,
CorePackage, Recommended Package, Maintainer, and
PackagesMaintaining. The meaning of each predictor has been
explained in section 1.2 and 2.
In KNN, the total number of sample points is used as K; when
calculating the cosine similarity between two packages, those
numerical predictors are used.
For those that are core package, we predict their class labels to
be -1(indication the package will be installed by a user) for all
users.
4.3 ResultsFive-fold cross-validation AUC is used as the criterion.
The results are summarized in the table 4.3.1:
Model Average AUC in 5-fold cross-
validation
Baseline model 0.9028
KNN with K= # observations 0.9472
L1 regularized logistic regression+Adaboost
0.9635
Nave Bayes 0.9019
SVM with quadratic kernel 0.9455
Table 4.3.1
5. DiscussionAmong the models stated above, L1 regularized logistic
regression gives the highest AUC. In this paper, only in-links,
such as being suggested by other packages, are considered. In
future studies, out-links, like importing other packages or
suggesting other packages can also be utilized. Classification
might also be improved if we separately classify the packages in
each topic.
REFERENCES
[1] Alexandros Karatzoglou and David Meyer. Support Vector Machines in R.
Journal of Statistical Software, Volume 15, Issue 9,April 2006
[2] Greg Ridgeway. Generalized Boosted Models: A guide to the gbm package.August 3, 2007
[3] Jerome H. Friedman.Regularized Discriminant Analysis. Journal of theAmerican Statistical Association, Vol. 84, No. 405 (Mar., 1989), pp. 165-175
[4] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of
Statistical Learning
[5] http://cran.r-project.org/web/packages/e1071/index.html
[6] http://www.cs.princeton.edu/~schapire/boost.html
[7] http://cran.r-project.org/web/packages/glmnet/index.html
[8] http://cran.r-project.org/
-
7/28/2019 R Package Recommendation
4/4
[9] Training data set and the web crawler open source code are downloadedfrom: https://github.com/johnmyleswhite/r_recommendation_system