An Ensemble-Based Recommendation Engine for Scientific Data...

8
An Ensemble-based Recommendation Engine for Scientific Data Transfers William Agnew Georgia Inst. of Tech. 30332 North Ave NW Atlanta, GA [email protected] Michael Fischer University of Wisconsin-Milwaukee Milwaukee, WI [email protected] Ian Foster University of Chicago 5801 S Ellis Ave Chicago, IL [email protected] Kyle Chard University of Chicago 5801 S Ellis Ave Chicago, IL [email protected] ABSTRACT Big data scientists face the challenge of locating valuable datasets across a network of distributed storage locations. We explore methods for recommending storage locations (“endpoints”) for users based on a range of prediction models including collaborative filtering and heuristics that consider available information such as user, institution, access his- tory, endpoint ownership, and endpoint usage. We combine the strengths of these models by training a deep recurrent neural network on their predictions. Collectively we show, via analysis of historical usage from the Globus research data management service, that our approach can predict the next storage location accessed by users with 80.3% and 95.3% ac- curacy for top-1 and top-3 recommendations, respectively. Additionally, our heuristics can predict the endpoints that users will use in the future with over 75% precision and re- call. 1. INTRODUCTION As data volumes and network speeds increase, the task of determining where useful data are to be found becomes more complex. Services such as Globus [12] simplify the manage- ment of scientific data (for example, by streamlining shar- ing [8] and publication [7]). Still, an individual scientist may have access to hundreds or even thousands of storage systems. Which should they visit next? Commercial web services, such as travel, e-commerce, and television streaming services rely on user-specific recommen- dations to both enhance user experience and drive revenue streams [4]. These recommendations are possible because of the large amount of usage information captured by these services. Here we investigate the feasibility of providing sim- ilar, targeted data location recommendations to scientists, with the goals of improving both 1) user experience and 2) understanding of how large scientific data are used. We use the approximately 3.5 million transfer operations conducted via Globus between research storage systems over the past five years as a basis for this study. We envisage such capa- bilities could be offered as an online recommendation engine that would allow users to quickly find the data they are looking for while also enabling them to explore other data of relevance (Figure 1). We define and evaluate a collection of specialized endpoint prediction heuristics that consider unique features of large scientific data, Globus users, and storage endpoint informa- tion (e.g., institution, transfer frequency, and endpoint loca- tion) derived from historical Globus usage data. We measure the performance of these heuristics in terms of how well they predict 1) the two specific endpoints used in a user’s next transaction and 2) the set of endpoints used by a a user in the future. We show that we can predict the next endpoint correctly over 95% of the time and future endpoints with over 75% precision and recall. In addition, by analyzing the relative contributions of the different features used by the heuristics, we explore the contribution of each feature to our recommendations. We find a large and surprising dif- ference between good recommendation strategies for source and destination endpoints. The rest of this paper is as follows. In §2 we investigate historical Globus usage as the basis for developing endpoint recommendation heuristics. We then describe in §3 a base- line recommendation algorithm, using industry standard col- laborative filtering. In §4 we describe a series of endpoint recommendation heuristics developed using our observations of historical usage. In §5, we present the neural network used to combine our heuristics’ recommendations. Next, in §6, we evaluate the performance of our approaches. Finally, we compare with related work in §7 and conclude in §8. Figure 1: Mockup of recommendation interface. 2016 Seventh International Workshop on Data-Intensive Computing in the Clouds 978-1-5090-6158-7/16 $31.00 © 2016 IEEE DOI 10.1109/DataCloud.2016.5 9 2016 Seventh International Workshop on Data-Intensive Computing in the Clouds 978-1-5090-6158-7/16 $31.00 © 2016 IEEE DOI 10.1109/DataCloud.2016.5 9

Transcript of An Ensemble-Based Recommendation Engine for Scientific Data...

Page 1: An Ensemble-Based Recommendation Engine for Scientific Data …datasys.cs.iit.edu/.../datacloud16-bigdata-paper.pdf · 2016-11-25 · An Ensemble-based Recommendation Engine for Scientific

An Ensemble-based Recommendation Engine forScientific Data Transfers

William AgnewGeorgia Inst. of Tech.30332 North Ave NW

Atlanta, [email protected]

Michael FischerUniversity of

Wisconsin-MilwaukeeMilwaukee, WI

[email protected]

Ian FosterUniversity of Chicago

5801 S Ellis AveChicago, IL

[email protected]

Kyle ChardUniversity of Chicago

5801 S Ellis AveChicago, IL

[email protected]

ABSTRACTBig data scientists face the challenge of locating valuabledatasets across a network of distributed storage locations.We explore methods for recommending storage locations(“endpoints”) for users based on a range of prediction modelsincluding collaborative filtering and heuristics that consideravailable information such as user, institution, access his-tory, endpoint ownership, and endpoint usage. We combinethe strengths of these models by training a deep recurrentneural network on their predictions. Collectively we show,via analysis of historical usage from the Globus research datamanagement service, that our approach can predict the nextstorage location accessed by users with 80.3% and 95.3% ac-curacy for top-1 and top-3 recommendations, respectively.Additionally, our heuristics can predict the endpoints thatusers will use in the future with over 75% precision and re-call.

1. INTRODUCTIONAs data volumes and network speeds increase, the task ofdetermining where useful data are to be found becomes morecomplex. Services such as Globus [12] simplify the manage-ment of scientific data (for example, by streamlining shar-ing [8] and publication [7]). Still, an individual scientistmay have access to hundreds or even thousands of storagesystems. Which should they visit next?

Commercial web services, such as travel, e-commerce, andtelevision streaming services rely on user-specific recommen-dations to both enhance user experience and drive revenuestreams [4]. These recommendations are possible becauseof the large amount of usage information captured by theseservices. Here we investigate the feasibility of providing sim-ilar, targeted data location recommendations to scientists,with the goals of improving both 1) user experience and 2)understanding of how large scientific data are used. We usethe approximately 3.5 million transfer operations conductedvia Globus between research storage systems over the past

five years as a basis for this study. We envisage such capa-bilities could be offered as an online recommendation enginethat would allow users to quickly find the data they arelooking for while also enabling them to explore other dataof relevance (Figure 1).

We define and evaluate a collection of specialized endpointprediction heuristics that consider unique features of largescientific data, Globus users, and storage endpoint informa-tion (e.g., institution, transfer frequency, and endpoint loca-tion) derived from historical Globus usage data. We measurethe performance of these heuristics in terms of how well theypredict 1) the two specific endpoints used in a user’s nexttransaction and 2) the set of endpoints used by a a user inthe future. We show that we can predict the next endpointcorrectly over 95% of the time and future endpoints withover 75% precision and recall. In addition, by analyzingthe relative contributions of the different features used bythe heuristics, we explore the contribution of each feature toour recommendations. We find a large and surprising dif-ference between good recommendation strategies for sourceand destination endpoints.

The rest of this paper is as follows. In §2 we investigatehistorical Globus usage as the basis for developing endpointrecommendation heuristics. We then describe in §3 a base-line recommendation algorithm, using industry standard col-laborative filtering. In §4 we describe a series of endpointrecommendation heuristics developed using our observationsof historical usage. In §5, we present the neural networkused to combine our heuristics’ recommendations. Next, in§6, we evaluate the performance of our approaches. Finally,we compare with related work in §7 and conclude in §8.

Figure 1: Mockup of recommendation interface.

2016 Seventh International Workshop on Data-Intensive Computing in the Clouds

978-1-5090-6158-7/16 $31.00 © 2016 IEEE

DOI 10.1109/DataCloud.2016.5

9

2016 Seventh International Workshop on Data-Intensive Computing in the Clouds

978-1-5090-6158-7/16 $31.00 © 2016 IEEE

DOI 10.1109/DataCloud.2016.5

9

Page 2: An Ensemble-Based Recommendation Engine for Scientific Data …datasys.cs.iit.edu/.../datacloud16-bigdata-paper.pdf · 2016-11-25 · An Ensemble-based Recommendation Engine for Scientific

Figure 2: Globus Network. Endpoints are repre-sented as vertices. Edges represent transfers be-tween endpoints. Larger endpoints have transferredmore frequently with more distinct endpoints. End-points that transfer more frequently with each otherare closer.

2. GLOBUSGlobus, a software as a service provider of research man-agement capabilities, supports data transfer, synchroniza-tion, and sharing [2]; data publication [7]; identity manage-ment, authentication, and authorization [21]; and profile andgroups management [6]. It provides a rich source of infor-mation from which we can understand scientific data accesspatterns. Thus, as the basis for developing recommendationheuristics we first explore historical Globus usage to derivefeatures that may be indicative of usage.

Over its six years of existence, Globus has been used toconduct almost 3.5 million transfers, totaling more than180PB and 2.5 billion files, among 23,000 unique endpoints.Thus, Globus usage can be represented as a network with23,000 vertices and 3.5 million edges. Part of this networkis illustrated in Figure 2.

The graph highlights the different usage patterns of Globususers and endpoints. There are distinct endpoint clusters,typically centered around a single large (more frequentlyused) endpoint. These clusters are clearly related in someway, perhaps, for instance associated with a data source(e.g., the National Center for Atmospheric Research’s Re-search Data Archive), a particular scientific group, or a par-ticular instrument or resource (e.g., the BlueWaters Super-computer). Some clusters are completely independent of therest of the network, while others are more tightly connected.We also see that a small number of endpoints have been in-volved in many transfers with many endpoints, while manyendpoints have participated in few transfers with few end-points.

Globus is used in a broad range of scenarios, including au-tomated usage by scripts and third-party applications. Dueto our primary interest in providing recommendations tousers, rather than programs (e.g., those that backup super-computers), that use Globus, we focus our prediction effortson the roughly 800,000 operations initiated using the webinterface. Figure 3 further illustrates usage patterns show-

ing the data volumes, number of transfers, and number ofunique endpoints per user for all Globus usage submittedvia the Web interface. These long-tailed distributions are adefining feature of Globus (and perhaps scientific data usagein general), and present a challenge for endpoint prediction.The endpoints with low usage provide little historical infor-mation on which to base predictions, and the endpoints withhigh activity are often used by many different scientists eachwith different usage patterns.

Although Globus has been used to transfer billions offiles, the mappings between users and endpoints, and end-points and endpoints, are sparse. Only ∼0.01% of potentialuser/endpoint pairs and 0.006% of potential endpoint/endpointpairs are present. Figure 4 illustrates this sparsity with aheatmap showing the transactions between the 200 most ac-tive users and endpoints.

In addition to highlighting sparsity the heatmap showstwo interesting patterns. First, an approximately diagonalline that shows correlation between the most active usersand the most active endpoints: that is, active users stronglyfavor a single endpoint so much so that this favored endpointhas a usage ranking similar to the user’s. This striking pat-tern leads us to suspect that recommending the endpointmost frequently used by a user would provide good results.The second usage characteristic is the presence of verticallines. While most users use one endpoint much frequentlythan others, these lines indicate that some endpoints (pri-marily the most active endpoints) are used by many users.This knowledge can be leveraged by identifying and recom-mending these broadly used endpoints.

3. COLLABORATIVE FILTERINGCollaborative filtering (CF) is a technique commonly usedby recommendation systems to determine rankings based onother users’ rankings. In simple terms, the model assumesthat if user A has a preference for the same item as user B,then A is more likely to choose another item preferred by Bthan a user chosen at random. CF techniques are commonlyapplied to product and movies recommendations: CF wasused to win the Netflix Challenge [28], for example. CF hasalso been used with success to recommend web services tousers [27].

While CF is typically used to recommend new productsto users based on explicit rankings, thus it is more suitablefor our second recommendation task of predicting a set ofendpoints to be used in the future. We use the popularGraphLab [19] toolkit to implement a CF model. We set auser’s rating of an endpoint to 1 if that user has used thatendpoint, and 0 otherwise. We also give the model the partsof users’ email suffixes (for example, [email protected] to the categories“gatech”and“edu”) and the ownersof each endpoint, the idea being that the CF model couldfind groups of collaborating users. We then applied the rank-ing factorization recommender [1], which writes all user-itemrecommendation weights as an equation of many unknownvariables and then uses stochastic gradient descent to findthe values for those variables that minimize some cost func-tion. These variable values are then used to predict futureendpoint ratings for users.

4. ENDPOINT PREDICTIONHEURISTICSGlobus stores detailed records regarding users, endpoints,

1010

Page 3: An Ensemble-Based Recommendation Engine for Scientific Data …datasys.cs.iit.edu/.../datacloud16-bigdata-paper.pdf · 2016-11-25 · An Ensemble-based Recommendation Engine for Scientific

(a) Transfer volume per user (b) Transfers per user (c) Distinct endpoints per user

Figure 3: Long-tailed distributions of Globus usage. In (a) and (b), all transfers are in orange, and web(human) initiated transfers are in blue. (c) shows endpoint usage for all transfers; web transfers are similar.

Figure 4: Heatmap of user-endpoint transfer pairsfor the 200 most active users and endpoints. Userson y axis. Endpoints on x axis. Colors are basedon the log of the number of transactions for eachuser-endpoint pair.

and transfer history: user names, email addresses, and insti-tutions; endpoint descriptions, locations, and settings; andtransfer settings, performance, and errors. We use this in-formation to develop a collection of specialized endpoint rec-ommendation heuristics. When queried with user ID, date,and a positive integer n, each heuristic returns what it be-lieves are the top-n best endpoint recommendations for thatuser ID on that date. Now we describe each heuristic.History: The history heuristic does exactly what one

would expect: it predicts that the top-n best source (S) /destination (D) endpoints are the n most recently used S/Dendpoints.Markov Chain: The Markov Chain heuristic correlates

previously used endpoints with potential future endpoints.To do so, it maintains a transition matrix of the probabilitiesof using each endpoint as a S/D conditioned on a particularendpoint being previously used as a S/D. These probabilitiesare estimated online by the Markov chain heuristic from theobserved transitions. According to this heuristic, the top-n most likely S/D endpoints for a user are the top-n most

likely endpoint transitions given that user’s previous S/Dendpoint choice.Most unique users: The most unique users heuristic

takes advantage of the long-tailed usage distribution: a smallnumber of endpoints are used by many users and most end-points are used by few users. The top-n best S/D endpointsare the endpoints with the nth most unique users who usedthat endpoint as a S/D.Institution: The institution heuristic maps users to their

institution based on that user’s associated email suffix. Forexample, “[email protected]” would be mapped to theinstitution “gatech.edu.” The top-n best S/D endpoints fora user are the n endpoints owned by a user belonging to thesame institution with the most unique users that have usedthem as a S/D.Endpoint ownership: The endpoint ownership heuris-

tic recommends the top-n most recently created endpointsowned by a user, based on the idea that if a user creates anew endpoint, that user is likely to use it soon.

5. COMBINING HEURISTICSOur heuristics model different aspects of usage and thereforeperform well for different classes of users. To provide thebest possible recommendations we combine these heuristicsinto a single, superior heuristic. To do so, we trained a deeprecurrent neural network [14] on historical Globus data toselect the best endpoint recommendations from each heuris-tic. We choose to use a recurrent neural network over moretraditional, simpler ensemble methods because of the greatsuccess recurrent neural networks have achieved in learningseries [16].

The model for this neural network is shown in Figure 5.The basic model is composed of two LSTM blocks stackedon top of two fully connected layers. The first LSTM blockhas an output size of 15 and the next LSTM block and thetwo fully connected layers have input and output sizes of 15.The first fully connected block uses a ReLU activation func-tion which has proven effective in deep neural networks [13].The last layer of the network uses a softmax activation func-tion, which creates a probability distribution of outputs. Wedo not claim that this network architecture is optimal, butwe justify its complexity by comparing it to a simpler ar-chitecture of a single dense layer of size 15 and a softmaxoutput layer (§6).

As we see in Figure 3(c), there is a a significant differencebetween endpoints used as sources and endpoints used as

1111

Page 4: An Ensemble-Based Recommendation Engine for Scientific Data …datasys.cs.iit.edu/.../datacloud16-bigdata-paper.pdf · 2016-11-25 · An Ensemble-based Recommendation Engine for Scientific

destinations. To address this difference we have trained sep-arate neural networks for source and destination endpointrecommendation. We explore this choice in §6.

5.1 Neural network inputThe input to the neural network consists of two parts: rec-ommendation weights and additional problem information.To obtain recommendation weights, each heuristic recom-mends its top-n endpoints, along with their weights. If anendpoint has already been recommended by another heuris-tic, then the heuristic continues recommending top end-points until it has recommended n endpoints distinct fromthe other recommendations or until it can no longer makerecommendations. Disallowing duplicate recommendationsallows the network to consider more endpoints for its size,but it may reduce accuracy in some cases: if a particularendpoint is weighted highly by many heuristics, then thatendpoint is likely a good recommendation. To alleviate thisproblem, we include each heuristic’s weight for each recom-mended endpoint.

To give a brief example of how we incorporate the heuris-tics into our neural network, consider an endpoint X. Eachheuristic weights endpoint X, that is, outputs how confidentit is that endpoint X is the correct endpoint choice. If theuser owns only Y , then the Endpoint Ownership heuristicwill assign Y a weight of one. If endpoint Z is the sec-ond most recent endpoint used, then the history heuristicwill assign Z weight of 1

2. Each heuristic’s weighting is

fairly arbitrary, having only the property that more highlyranked endpoints have larger weights; in most cases, we usea weighting function of 1

rank. The neural network is then

presented with the set of endpoints X,Y, Z along with theirweights. use such arbitrary weighting functions in the ab-sence of more natural weighting functions and leave it tothe neural network to learn how to interpret each heuristic’sweightings.

The performance of different heuristics could be affectedby external information, for example, whether the user worksin academia or industry, if the endpoint is located in thesame country, or if the data accessed is large. In short, anynumber of factors could affect endpoint selection. To fur-ther improve recommendation accuracy we provide the neu-ral network with every piece of information available andlet the neural network learn which information is impor-tant. The complete list of additional input information is asfollows: transaction date, user institution type (academic,commercial, government, other), number of Globus usersaffiliated with user institution, total number and volume ofuser transactions, data size, transaction frequency and vol-ume for each recommended endpoint, current accuracy foreach heuristic for the user, and the correctness of each of theprevious recommendations made for the user. Finally, eachheuristic also provides its overall confidence for its recom-mendations for that user, again a fairly arbitrary weightingthat is interpreted by the neural network. For example, thehistory heuristic uses the size of the user’s transaction his-tory to compute its confidence.

5.2 Neural network outputRecall that we focus on two recommendation use cases: rec-ommending the next endpoint used and recommending end-points that might be used in the future. When recommend-ing the next endpoint, we want the neural network to out-

Figure 5: Neural Network Block. Takes as in-put heuristic recommendation weights and memoryfrom past recommendations to the user, and outputsreweighed endpoint recommendations and updatedrecommendation memory.

put a weight of 1 if the endpoint will be used as the nexttransaction and a weight of 0 for all other endpoints. Whenrecommending the set of future endpoints, the desired out-put is 1 if the corresponding endpoint will be used within thespecified time interval after the transaction, and 0 otherwise.

6. EVALUATIONWe study our ability to predict 1) the next endpoint used;and 2) the endpoints to be used in the future. In the firstcase we compare the performance of collaborative filtering,our heuristics, and the ensemble neural network. In the sec-ond, we compare the performance of collaborative filteringand the ensemble neural network.

We compare performance with respect to accuracy as wellas precision and recall. Accuracy measures our ability tocorrectly predict endpoints amongst a group of recommen-dations. We define two accuracy measures. The first, totalaccuracy, is the fraction of all endpoints correctly predictedwhere tp is the number of correct predictions and nt is thenumber of total transfers:

Total Accuracy =tpnt

(1)

As shown in §2, most users perform few transfers. Whilepredicting endpoints for the highly active users is important,we cannot ignore users with few transfers, as the latter arelikely would most benefit from good endpoint prediction.Therefore, to gauge how well our recommendations performfor each user, we also report user accuracy as an averageaccuracy across all users. Let tu be the number of transfers

1212

Page 5: An Ensemble-Based Recommendation Engine for Scientific Data …datasys.cs.iit.edu/.../datacloud16-bigdata-paper.pdf · 2016-11-25 · An Ensemble-based Recommendation Engine for Scientific

performed by a user and nu be the total number of users.That is:

User Accuracy =

∑ni=1

tptu

nu(2)

Precision and recall, two commonly used metrics in rec-ommendation and information retrieval systems, measurehow good recommended items are and how likely good itemsare to be recommended, respectively. They are defined asfollows, where tp and tn are true positives and negatives,respectively, and fn is false negatives:

Precision =tp

tp + fp(3)

Recall =tp

tp + fn(4)

All training and recommendation algorithms took only afew hours to run on hundreds of thousands of transactions,making them very feasible for real world use.

6.1 TrainingWe split historical Globus data into training and validationsets. We trained two groups of four (one for each combina-tion of source/destination and top-1/top-3) networks. Thefirst group is used to predict the endpoints to be used in thenext transaction, the second group is used To predict theset of endpoints to be used in the future. The networks forpredicting endpoints to be used in the next transaction weretrained on the first 500,000 transactions for 1000 epochs withbatch sizes of 5,000 using the categorical logarithmic objec-tive function, and validated using the most recent approx-imately 450,000 transactions. For the networks to predictthe endpoints used in the future, we first defined “future”for three different time spans: one week, one month, andone year. That is, when we predict endpoints to be usedin in the future, we evaluate by predicting a set of end-points that a user will use in the next week, month, or year.The future endpoint networks were trained for 1000 epochswith a batch size of 5,000 on all but the transactions in thepast 1.5 years (about 380,000) with the categorical logarith-mic objective function. For validation data, we used everytransaction that occurred within the past 1.5 years up untilthe specific time period evaluated. That is, given the needto evaluate within a time period (e.g., 1 year) we must haveat least that period’s data available for validated (e.g., for 1year experiments we validated historical data from 1.5 yearsago to 1 year ago). All experiments contained at least atleast 150,000 transactions.

6.2 Next endpoint recommendationWe next examine our heuristics’ performance at predictingthe endpoints to be used in the next transaction. Figure 6shows each heuristic’s total accuracy and Figure 7 shows useraccuracy. Each figure shows accuracy when recommendingthe top-1 and top-3 endpoints. In addition to each heuristicand the neural network, we also report on recommendationsproduced by a simple neural network implementation (Shal-low Neural Network) and the optimal algorithm, Combined,that selects the correct endpoint if it is predicted by any sin-gle heuristic. The shallow neural network contains a singledense layer of size 15 and a softmax output layer.

Figure 6: Transfer recommendation accuracy

Figure 7: User recommendation accuracy

Figures 6 and 7 show that the deep neural network out-performs any single heuristic and indeed performs close tooptimally, with a difference of 14.9%/0.8% for transfer rec-ommendation accuracy and 13.7%/5.1% for user recommen-dation accuracy, relative to the optimal combined strategy.The history heuristic outperforms the other heuristics in allcases, this is due to the frequency with which users use thesame endpoints. The most unique users, institution, andowned endpoints heuristics perform significantly worse thanthe others. Surprisingly, the Turi CF heuristic does muchworse than far simpler heuristics, like history, providing fur-ther evidence of the need for domain-specific knowledge. (Inthe case of the history heuristic, this knowledge may be thatusers often choose the same endpoint multiple times; in con-trast, customers rarely buy the same book multiple times.)It should also be noted, that predicting the same item andusing implicit ratings is not a typical use-case for CF. For allheuristics, user accuracy is lower than transaction accuracy,but this is not unexpected: there are many users with littletransaction history from which to base predictions, and useraccuracy gives greater weight to these users. Finally, wenote that the deep recurrent neural networks have virtuallythe same accuracy on all datasets, except for top-1 transferaccuracy. This shows that our deep recurrent neural net-work is not an overly complex model, yet still adds valuewhen compared to a simpler model.

6.3 Future endpoint recommendationWe next study the effectiveness of our heuristics at predict-ing the endpoints a user will use in the future: more specif-

1313

Page 6: An Ensemble-Based Recommendation Engine for Scientific Data …datasys.cs.iit.edu/.../datacloud16-bigdata-paper.pdf · 2016-11-25 · An Ensemble-based Recommendation Engine for Scientific

Figure 8: Transfer precision and recall for differentrecommendation thresholds

Figure 9: User precision and recall for different rec-ommendation thresholds

ically, the endpoints that the user will use a week, month,and year after each transaction. We use the same heuris-tics and neural network combiner to predict these endpoints,but instead of recommending the top-1 or top-3 endpoints,we predict all endpoints that the neural network combinerweights above a certain threshold for the requested periodof time. A lower threshold predicts more endpoints, increas-ing recall (the ratio of true predictions to endpoints actuallyused) but decreasing precision (the ratio of true predictionsto all predictions). A higher threshold value, in contrast, rec-ommends heavily weighted endpoints and thus gives a lowerrecall but higher precision. Precision and recall are bothdesirable in different situations; by adjusting our thresholdvalue, we can observe obtainable precision vs. obtainablerecall for each heuristic. As when predicting current trans-action endpoints, we consider both total precision and recall(Figure 8) and user precision and recall (Figure 9). We showonly the precision and recall of the Turi CF heuristic andneural network, as the other heuristics do not have obviousgood endpoint weighting functions.

We see in Figures 8 and 9 that by adjusting the thresh-old the neural network maximizes precision and recall upto a point, whereas the CF approach performs significantlyworse. Precision and recall for predictions over the nextweek and month are relatively similar, however, as expected,predicting endpoints for a year in the future is less accu-rate. A similar trend is observed for Turi CF. Our neuralnetwork is able to achieve approximately 75% transfer pre-cision and recall for all cases, and 90% for the week andmonth intervals, significantly better than Turi CF. That is,

Figure 10: Independent source and destination net-works vs. combined network

our model’s predictions are correct at least 75% of the time,and if an endpoint will be used, our model will predict itat least 75% of the time. We again attribute Turi CF’s sig-nificantly worse performance to its lack of domain-specific,content-based heuristics (e.g., owned endpoints, institution,and most unique users), which prevent it from producing acomprehensive list of the endpoints that a user could use,much less a list with few false positives. As in the previoussection, our results are better when averaged over transfersrather than users because of the many users with few trans-actions.

6.4 Neural network analysisWe next explore what our neural networks have learned.While the neural networks perform well, it is not obviouswhat features are being used to predict usage. By exploringthese features, we can gain insight into what factors are mostindicative of future use.

First, we examine our choice to train two independent net-works (for source and destination endpoints) rather than asingle combined network for both (Figure 10). The com-bined network performs comparably to the separate net-works for top-1 predictions (within 0.2% and 1.3%). How-ever, individual networks perform better for top-3 predic-tions (3.1% and 3.7%). This shows there is a small benefitobtained by using two networks.

Next we investigate which pieces of information are mostrelated to prediction accuracy. To do so, we simply removeindividual features by setting its inputs to zero. If an im-portant feature is removed, then accuracy will drop signifi-cantly. Figure 11 shows the results for both source and des-tination networks. Note the striking difference between thesource and destination networks: the destination network isrelatively unaffected by the removal of features (except thehistory heuristic), whereas the source network is affectedby the removal of any feature. This tells us that the sourceand destination recommenders behave in very different ways:the destination recommender relies very heavily on a user’srecent history when making recommendations; the sourcerecommender, while still relying heavily on users’ histories,makes significant use of other heuristics too. As these net-works are quite accurate, this tells us that whether a userhas recently used an endpoint is a very important factor indetermining if that endpoint will be used as a destination,but many factors, including the endpoint’s popularity at theuser’s institution and overall, are important in determiningif the user will use that endpoint as a source.

1414

Page 7: An Ensemble-Based Recommendation Engine for Scientific Data …datasys.cs.iit.edu/.../datacloud16-bigdata-paper.pdf · 2016-11-25 · An Ensemble-based Recommendation Engine for Scientific

Figure 11: Top-3 next transfer accuracy with individual features removed

7. RELATEDWORKWhile many researchers have studied characteristics of sci-entific networks and methods for recommending data andanalysis tasks in workflow systems, we are not aware of otherwork on recommending data storage locations.

Predicting transfer throughput is an important tool foroptimizing transfer parameters and identifying and fixingsoftware and hardware bottlenecks [23, 20, 18, 22]. In ad-dition to creating throughput optimization schemes, recentstudies of throughput prediction have studied the factorsthat affect transfer rate, including the effects of disk and raidcontroller bandwidth, data compression, and parallelism [15];the effects of transfer protocol characteristics, such as bufferand window size [25]; and the detrimental effects of con-current transfers and prioritization of transfers representingcertain use cases [17]. Our work is differentiated both byits focus and also its approach, as none of these efforts useneural network methods for combining predictions. How-ever, there is potential for these approaches to be mutuallybeneficial. For example, differentiating the throughput oftransfers between the same pair of endpoints but initiatedby different users requires some knowledge of those users,such as if they work in a field with relatively incompress-ible data; this knowledge could be used to predict whichendpoints those users will use in the future. Furthermore,throughput prediction typically relies on various heuristics,whether average, median, or more complex statistical mod-els. Not only does our neural network approach providea powerful way of combining heuristics, but also a way tointegrate problem information that is potentially relevantbut difficult to use as a heuristic—something the previousensemble methods used in throughput prediction do not.Finally, neural networks are able to learn complex and sub-tle relationships, and are reasonably amenable to analysis,allowing more insight into which factors influence variouscharacteristics of scientific data transfers.

Another area of interest is the prediction of groups offiles that are frequently transferred together. By identifyingthese groups, transfer performance can be improved usingbetter caching and job scheduling algorithms [11]. Whilethis problem is not directly related to endpoint prediction,we believe that many families of heuristics for character-izing transfers model both well. For example, the historyheuristic is one of the best heuristics for predicting futureendpoint usage, and historical data about which files weretransferred together is often used to identify file groups. By

using families of heuristics, such as those described in thispaper, specifically the Markov model and neural network,we believe that larger and more strongly associated groupsof files could be identified.

Workflow systems such as Galaxy, Kepler, and Tavernaare used to orchestrate complex scientific analyses composedof independent applications. These systems provide inter-faces for many users to develop and execute workflows, andtherefore provide a rich source of information regarding users,workflows, applications and data [24, 10]. Following a simi-lar motivation to our work, researchers have developed rec-ommendation approaches to simplify usage of scientific work-flow systems by recommending workflows, applications, anddata to users [26, 5, 3, 9]. While these approaches are ina different domain, we have applied similar techniques torecommending endpoints. Furthermore, our work is comple-mentary to these efforts as endpoint recommendations couldbe used to support the creation of workflows and selectionof input data.

8. SUMMARYWe have developed and evaluated a collection of heuristicsfor recommending data locations to users. By leveragingrich sources of historical usage information and informationabout users, endpoints, and transfer settings we were ableto create specialized heuristics that consider transfer his-tory, user institutions, endpoint ownership and other infor-mation. While these heuristics performed well individually,the greatest performance is obtained by combining heuristicsinto an ensemble model using a recurrent neural network.Our neural network model was able to correctly predict theendpoints to be used in the next transfer with 80.3% and95.3% accuracy for top-1 and top-3, respectively. It wasalso able to predict the endpoints that a user will use in thefuture with greater than 75% precision and recall. Theseresults not only provide value for users but may also pro-vide insights into how scientific big data is used and couldbe applied to develop better algorithms for accessing andtransferring data.

We aim next to develop and deploy an online recommen-dation engine within Globus. We are also interested in char-acterizing, modeling, and improving the use of scientific bigdata by analyzing the performance of heuristics for differ-ent user profiles. Finally, we note that our deep recurrentneural network has virtually the same accuracy as a simplerneural network. While the simpler neural network still inte-

1515

Page 8: An Ensemble-Based Recommendation Engine for Scientific Data …datasys.cs.iit.edu/.../datacloud16-bigdata-paper.pdf · 2016-11-25 · An Ensemble-based Recommendation Engine for Scientific

grates novel features and heuristics in a new way, we find thedifference between the two models disappointing, and hopeto improve our usage of deep recurrent neural networks infuture work.

AcknowledgementsThis work was supported in part by National Science Foun-dation grant NSF-1461260 (BigDataX REU) and Depart-ment of Energy contract DE-AC02-06CH11357.

9. REFERENCES[1] Ranking factorization recommender.

https://turi.com/products/create/docs/generated/graphlab.recommender.ranking factorization recommender.RankingFactorizationRecommender.html. Accessed Sept2016.

[2] B. Allen, J. Bresnahan, L. Childers, I. Foster,G. Kandaswamy, R. Kettimuthu, J. Kordas, M. Link,S. Martin, K. Pickett, and S. Tuecke. Software as a servicefor data scientists. Communications of the ACM,55(2):81–88, Feb. 2012.

[3] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher,and S. Mock. Kepler: an extensible system for design andexecution of scientific workflows. In 16th InternationalConference on Scientific and Statistical DatabaseManagement, pages 423–424. IEEE, 2004.

[4] A. Ansari, S. Essegaier, and R. Kohli. Internetrecommendation systems. Journal of Marketing research,37(3):363–375, 2000.

[5] J. Cao, S. A. Jarvis, S. Saini, and G. R. Nudd. Gridflow:Workflow management for grid computing. In 3rdIEEE/ACM International Symposium on ClusterComputing and the Grid, pages 198–205. IEEE, 2003.

[6] K. Chard, M. Lidman, B. McCollam, J. Bryan,R. Ananthakrishnan, S. Tuecke, and I. Foster. GlobusNexus: A platform-as-a-service provider of researchidentity, profile, and group management. Future GenerationComputer Systems, 56:571–583, 2016.

[7] K. Chard, J. Pruyne, B. Blaiszik, R. Ananthakrishnan,S. Tuecke, and I. Foster. Globus data publication as aservice: Lowering barriers to reproducible science. In 11thIEEE International Conference on e-Science, pages401–410, Aug 2015.

[8] K. Chard, S. Tuecke, and I. Foster. Efficient and securetransfer, synchronization, and sharing of big data. IEEECloud Computing, 1(3):46–55, Sept 2014.

[9] D. Churches, G. Gombas, A. Harrison, J. Maassen,C. Robinson, M. Shields, I. Taylor, and I. Wang.Programming scientific and distributed workflow withTriana services. Concurrency and Computation: Practiceand Experience, 18(10):1021–1037, 2006.

[10] E. Deelman, D. Gannon, M. Shields, and I. Taylor.Workflows and e-science: An overview of workflow systemfeatures and capabilities. Future Generation ComputerSystems, 25(5):528–540, 2009.

[11] S. Doraimani and A. Iamnitchi. File grouping for scientificdata management: lessons from experimenting with realtraces. In 17th international symposium on Highperformance distributed computing, pages 153–164. ACM,2008.

[12] I. Foster. Globus Online: Accelerating and democratizingscience through cloud-based services. IEEE Internet

Computing, 15(3):70, 2011.[13] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier

neural networks. In 14th International Conference onArtificial Intelligence and Statistics, page 275, 2011.

[14] A. Graves. Generating sequences with recurrent neuralnetworks. arXiv preprint arXiv:1308.0850, 2013.

[15] E.-S. Jung, R. Kettimuthu, and V. Vishwanath. Towardoptimizing disk-to-disk transfer on 100g networks. In IEEEInternational Conference on Advanced Networks andTelecommunications Systems, pages 1–6. IEEE, 2013.

[16] A. Karpathy. The unreasonable effectiveness of recurrentneural networks.http://karpathy.github.io/2015/05/21/rnn-effectiveness/[Accessed, Sept 2016], 2015. Accessed Sept 2016.

[17] R. Kettimuthu, G. Vardoyan, G. Agrawal, P. Sadayappan,and I. Foster. An elegant sufficiency: load-awaredifferentiated scheduling of data transfers. In InternationalConference for High Performance Computing, Networking,Storage and Analysis, page 46. ACM, 2015.

[18] C. Lee, H. Abe, T. Hirotsu, and K. Umemura. Predictingnetwork throughput for grid applications on networkvirtualization areas. In 1st International Workshop onNetwork-aware Data Management, pages 11–20, 2011.

[19] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola,and J. M. Hellerstein. Distributed graphlab: a frameworkfor machine learning and data mining in the cloud.Proceedings of the VLDB Endowment, 5(8):716–727, 2012.

[20] M. Swany and R. Wolski. Multivariate resourceperformance forecasting in the Network Weather Service. InACM/IEEE Conference on Supercomputing, 2002.

[21] S. Tuecke, R. Ananthakrishnan, K. Chard, M. Lidman,B. McCollam, and I. Foster. Globus Auth: A researchidentity and access management platform. In 12th IEEEInternational Conference on e-Science, 2016.

[22] S. Vazhkudai and J. M. Schopf. Predicting sporadic griddata transfers. In 11th IEEE International Conference onHigh Performance Distributed Computing, pages 188–196.IEEE, 2002.

[23] R. Wolski. Experiences with predicting resourceperformance on-line in computational grid settings.SIGMETRICS Performance Evaluation Review,30(4):41–49, 2003.

[24] J. Yu and R. Buyya. A taxonomy of workflow managementsystems for grid computing. Journal of Grid Computing,3(3-4):171–200, 2005.

[25] D. Yun, C. Q. Wu, N. S. Rao, B. W. Settlemyer, J. Lothian,R. Kettimuthu, and V. Vishwanath. Profiling transportperformance for big data transfer over dedicated channels.In International Conference on Computing, Networkingand Communications, pages 858–862. IEEE, 2015.

[26] J. Zhang, W. Tan, J. Alexander, I. Foster, and R. Madduri.Recommend-as-you-go: A novel approach supportingservices-oriented scientific workflow reuse. In IEEEInternational Conference on Services Computing, pages48–55, July 2011.

[27] Z. Zheng, H. Ma, M. R. Lyu, and I. King. Wsrec: Acollaborative filtering based web service recommendersystem. In IEEE International Conference on Web Services(ICWS), pages 437–444, July 2009.

[28] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan.Large-scale parallel collaborative filtering for the Netflixprize. In 4th International Conference on AlgorithmicAspects in Information and Management, pages 337–348,

2008.

1616