Recommendation of APIs for Mashup creation€¦ · mend mashup to users based on the usage history...

Recommendation of APIs for Mashup creation

Priyanka Samanta ([email protected])Advisor: Dr. Xumin Liu ([email protected])

B. Thomas Golisano College of Computing and Information SciencesRochester Institute of Technology, Rochester, New York, USA

ABSTRACTThe aim of this project is to improvise an existing recom-mendation system that recommends relevant APIs for cre-ating a mashup. This system would recommend a list ofrelevant APIs based on the API’s functionality, its usagehistory and popularity. The system takes a user input inthe terms of mashup description and then runs through theexisting database of mashups and APIs to recommend thelist. To improvise the existing the system, advanced topicmodelling and collaborative filtering is used to leverage thefunctionality of API and its usage history respectively.

1. INTRODUCTION

1.1 Problem IntroductionWith the advent of Web 2.0, creation of web mashup has

gained importance and popularity. Today web is more abouta repository of services rather than contents. As the termsuggests, mashup refers to a composition created by blend-ing multiple features into one. Web mashup refers to a webapplication that blends one or more web services to createa new service with enriched functionality and visualization.For eg. Mapskreig is a web mashup that combines the ser-vices from Craiglist and Google maps to help end users lookfor housing with a map interface.

Mashup creation from existing Web APIs is a boost tothe social software as it is created rapidly, involves softwarereusability and requires minimal technical skills. Today therepository of Web APIs has crossed 18000. With this largedataset and unstructured description of the APIs available,it is very challenging to look discover the relevant APIs fromthe giant growing repository of APIs online. To addressthis problem, many solutions have been proposed to recom-mend relevant APIs for mashup creation, that is primarilybased on API functionality, usage history of APIs in exist-ing mashups and popularity of the API. This paper is aimedto exploit advanced machine learning algorithms that wouldimprovise the performance of the existing recommendation

Copyright 2016 ACM X-XXXXX-XX-X/XX/XX ...$10.00.

system that incorporates all these features in recommendinglist of relevant APIs.

1. There has been an unstoppable growth in the field ofmachine learning. To exploit the functionality of theAPIs in creating the recommendation system, Hier-archical Dirichlet Process (HDP) has been used.This is advanced algorithm of probabilistic topic mod-elling that address few of the shortcomings of LatentDirichlet Allocation (LDA). HDP is used to representthe existing APIs and Mashups as the distribution ofthe latent topics and later use a similarity measureto compute how similar the user specification of themashup is to the existing mashup.

2. To leverage the usage history of APIs in the existingmashup, advanced version of matrix factorization hasbeen used. This forms an integral part of collaborativefiltering. This helps to provide a relevant list of APIsfor a user input of mashup based on its usage history.This system can efficiently handle the cold start prob-lem as well. Probabilistic matrix factorization helpsto overcome few of the drawbacks of the traditionalmatrix factorization used in collaborative filtering.

3. Following, Bayes Theorem is applied to combine theresult obtained from (i) and (ii) and the final list isrecommended based on the popularity of APIs. Popu-larity acts as a Quality of Services (QoS) measurefor the APIs.

1.2 Background and Definitions• API - Web API (Application Programming Interface)

is a web service that works on HTTP requests and re-sponses. It is a method that gets called by HTTP re-quests (programs) to get the results back. It is softwaresystem that supports communication between machinesacross the network.

• Mashhup - Mashup is a web application that blendsone or more web services together. It gives enrichedfunctionality in a new GUI. Mashups are easy to buildand requires minimal technical exposure.

Following paper has been organized under sections as follow-ing: Section 2, provides the relevant work done in this field.Section 3, outlines the approach to implement this project.Section 4 , details the implementation of the system followedby Section 5, 6 providing the results and the evaluation ofthe system respectively. Finally this work is concluded inSect 7.

Figure 1: Components of Web Service

2. RELATED WORKIn this section, a literature review is presented based on

the study conducted in field of this project. Several papersand works have been reviewed that details the work donefor the existing system along with few that talks about theadvanced machine learning algorithms that would help toimprovise the existing system and build this project.

2.1 Functionality based Recommendation us-ing tagging

The authors of [5] came up with the novel approach of rec-ommending mashups by finding the user’s interest from theirusage history of the services and developing a social networkexploiting the relationships among mashups, web APIs andtags. According to the author the existing recommendationsystems based on semantics, QoS (Quality of Services) andsocial network does not take into consideration the inter-ests of the users in the services. The paper contributes byformulating the new system incorporating the user’s interestand the social relationships among the services. The datasetused in the experimentation was ProgrammableWeb. “Thedata consisted of 6076 mashups, 4492 Web APIs and 1736tags”[5]. And the experiment proved to provide efficient andeffective results than the existing systems in service discov-ery. The authors came up with 2 models, User InterestModel and Social Network Model in order to leverage theuser’s interest and social relationship respectively. User In-terest Model: TF/IDF (Term Frequency/Inverse DocumentFrequency) technology was employed to build the User In-terest Model. This technology was used to determine theweightage of the terms in the corpus. Corpus here refersto the group of the description documents of the availablemashups. These description documents consists of the de-tails like name of the mashup, its functionality details, theweb APIs used, the names of the developers and the tags.At first, the term frequency is calculated by determining thefrequency of the ith term present in the jth mashup descrip-tion document. Further it was normalized as the length ofthe description documents for the available mashups mayvary. TF was calculated using the formula:

TF =Frequency of the ith term in the jth document

Total number of terms present in the jth document

Secondly, IDF is used, to measure the degree of commonnessof the ith term across all the mashup description documentpresent. It is calculated using the formula:

IDF = logTotal mashup description document in the corpus

Number of documents where the ith term appears

TF/IDF is calculated as the product of these terms as men-tioned above. This formula was deployed on the user’s usagehistory of the mashups. The description documents of all themashups which were used by user were combined to createa corpus. On the same TF/IDF was used to convert intovector of term weights. Similarly the vector of the mashupdescription was also formulated from the given mashup doc-ument. Followed by this, cosine similarity was applied tocompute the similarity between user’s interest based on us-age history and a given mashup service description.

Social Network Model : This model was built using therelationship existing between mashup and web API, and be-tween mashup and tags, in three stages. Mashups are cre-ated using one or more Web APIs. Further a tag is a key-word assigned to an information, here assigned to a mashupor a web API that describes its features and functionality.So the relation between API and mashup can be termed as“invocation relationship” [5] and the one existing betweentags and API or mashup can be coined as “social markingrelationship”[5]. Similarity between the 2 mashups is de-termined using the invocation relationship and the socialmarking relationship. That is 2 mashups are said to simi-lar to each other if they invoke similar APIs or if they aremarked with similar tags. This is computed using Jaccardcoefficient. Social Network is constructed in the form ofundirected graph. This graph contains the mashups as itsnodes, which are connected with edges, only if they possesssimilar tags and invoke similar APIs. The edges are markswith weights that defines the degree of similarity betweenthe mashups -eq(2) Based on the User Interest Model andSocial Network Model, mashup service is recommended bydetermining its similarity with similar mashups using theformula:

a× eq(1) + b× eq(2) . . . (a+ b) = 1, (0 <= a, b <= 1)

After this model was developed, it was experimented onthe ProgrammableWeb dataset as mentioned above. And itwas observed that the author proposed model outperformedthe normal process of service discovery based on useraAZsrelevance search or social network based search. The ac-curacy of the traditional method for finding mashups wasmeasured using DCG (Discounted Cumulative Gain), an al-gorithm to evaluate the information retrieval over the web.To conclude, this paper proposes a novel approach to recom-mend mashup to users based on the usage history of mashupsby the users and the social relationship existing between themashups and web APIs and tags.

The recommendation system can further be improved byexploiting the semantic meanings from tags[9], describes theapproach of deriving semantic meaning from tags that helpsin building “data-driven mashup” [9]. Tags are very power-ful hierarchical keyword, that when attached to a piece ofinformation, effectively describes the information. The au-thors of this paper, foresaw some of the existing problemsin building a mashup, which were: (i) the need of develop-ers for fast discovery of the services, as they build mashupsto address situational requirements with minimal program-

ming skills, (ii) an efficient way to look for services as repos-itory of the services has been growing exponentially. Foraddressing these problems, the authors proposed buildingup data-driven mashups based on semantic annotations oftags. The concepts of tags gained its importance from theuse in web resources. With the help of tags, it becomes easyto discover a service and also to get an insight on the serviceand build up a mashup. Few features that the paper em-phasizes on are: (i) Extracting the semantics from the tagsand associating with the services input and output, (ii) datacentric development of the mashup, requiring no details onthe underlying programs.

The solution that the authors mentioned to solve the prob-lems as mentioned above, is built with 3 major components:(i) a service repository, (ii) a service advisor and (iii) a userinterface. The service repository hosts all the web APIsand also allows publication of the same. APIs contains thetag information. Using several machine learning techniques,these tag information is extracted and semantically similartags are grouped into clusters. The role of the service ad-visor is to facilitate the composition of the services to buildup the desired goal based on the linking between the tags.Finally the user interface which is used to access the service.As a part of the solution, the authors came up with the “Tagbased service semantic mode” [9]. This model was built upbased on the fact that the end-users and the developers buildup mashups looking at the data consumed (input data) andthe data generated (output data) from the web APIs. Thusset of tags are associated with the input and output datathat makes the discovery of the services by the developerseasy. But the use of tags are not accurate always. Its limi-tation lies in the fact that, similar tags may be used by thehuman to define different information or similar informationmay be tagged with different keywords (tags). To overcomethis situation, an unsupervised learning technique was exer-cised to extract semantically similar meanings from tags andput them in cluster. For example the tags like apple, orange,banana can all be put in the cluster aAS fruit. This gaverise to the concept of “Feature Tag” [9], using which it canthat represent all other similar tags from the cluster. Alsothere came in the idea of tag taxonomy, where if a tag ‘a’expressed semantically higher notion than tag ‘b’ then tag‘b’ was named as the sub tag of tag ‘a’. Using this tag basedservice semantic model, the author proposed, the design ofdata driven mashup composition using web APIs. Two webservices, web service-1 and web service-2 are eligible to becomposed together if the output data: tag ‘x’ of web service-2 is consumed as the input data by web service-1 or if webservice-1 consumes tag ‘y’ where tag ‘x’ is the sub-tag oftag ‘y’. These two web services are then said to be mappedsemantically with tags. Once the input and output data hasbeen associated with tags, this model would facilitate theend-users to develop their mashup with ease, if they explaintheir desired goal in the form of tags which would makethe discovery of the web services containing semantic tags,easy and effective. The experimentation with this model wasconducted on “200 web services, 3100 mashups and 23,971tags” [9]. As a result is was observed that, around 80% ofthe developers achieve their goals of service creation in top3 list of recommendation. It may be concluded that theidea of using semantic-based tags in developing data-drivenmashups proved to be efficient in providing the output (interms of web services) needed by the developers to build

their product. [13] Provides a similar approach to look forsimilarity between documents but it exploits the use of se-mantic hashing of tags. The paper proposes a new model ofsemantic hashing extracted from tags and topic modeling.This model is a solution to the limitations of the existinghashing methods. These are: (i) tags are widely used in thereal world application to get associated with a piece of in-formation, but it has not been exploited completely, (ii) thesimilarity based on keyword matching fails to reflect the se-mantic similarity existing among the keywords. The modelproposed by the author incorporates both the usage of tagsand probabilistic topic modelling to determine the keywordsimilarity.

With the growth of internet and increase in web resourcesin the terms of images, texts, data, audios and videos present,it has become significantly important to look for efficienttechniques to determine similarity of search. When workingwith the large datasets, usually we face 2 challenges: (i) stor-ing the large scale data and (ii) retrieving the data from thislarge dataset, making the application of traditional text sim-ilarity difficult. To address these limitations, semantic hash-ing has proved to be effective. It works by transforming eachdocument to binary codes thus making similar documents tobe mapped to the similar binary codes. Given the query doc-ument, it can also be transformed into binary codes and thenHamming distance may be computed. This is to determinethe similarity existing between the documents. Thus thisprocess is effective, involving low cost as the large datasetcan be stored easily in the memory and the retrieval processwould also be fast. Semantic hashing though capable of han-dling this problem, but it has the limitation as mentionedabove. Thus the author proposed a new approach, Seman-tic Hashing using Tags and Topic Modelling (SHTTM). Asa part of solution, a framework is built, to ensure that thehashing code of tags are consistent and the semantic mean-ing of the documents are preserved by using topic modelling.Experiment performed on the real world dataset with thissolution proved to be far better than the existing methods.Paper sheds some light on various hashing techniques too.These are:

1. i. Semantic Hashing (SH): It uses fixed number of bitsto represent documents in binary code. Thus it takevery less time for computing the similarity betweendocuments.

2. ii. Locality Sensitive Hashing (LSH): This is one of thefrequently used hashing techniques, which maps thehigh dimension data, present in the Euclidean space tothe low dimension by using random linear projections.

3. iii. Self-Taught Hashing (STH): This hashing tech-nique is efficient than SH and LSH. But it is unable toaddress the 2 major limitations: (i) exploiting the useof tags and (ii) does not preserve the semantic similar-ity between the documents. This hashing mechanismuses the combination of supervised and unsupervisedlearning techniques to meet the goal. During unsuper-vised learning, using a fixed number of nearest neigh-bor it build up a similarity graph. Following, it addsthe documents to the k dimensional space and convertsthem to binary codes.

4. iv. Semi-Supervised Hashing (SSH): This hashing tech-nique is used to find out the similarity between a pair of

documents. While determining the similarity, it doesmake use of the tag information. If the documentsshare a common tag, then they are termed as Must-Link else termed as Cannot-Link. Using this con-cept, this hashing method generates the binary hash-ing codes, which are ought to be closer to each other forthe Must-Link documents and further for the Cannot-Link document pair. Unfortunately, this method failsto work, when the tag information is missing.

Finally the author proposes the SHTTM approach whichconsists of 2 components, (i) “Tag Consistency component”[13], (ii)“Similarity Preservation component” [13] Tag con-sistency component is needed to get an insight on the tagsassociated with the information. It is essential to deter-mine the correlation existing between the hashing code andtags (by using matrix factorization model) and also a wayto determine the tags in case they are missing (by usingconfidence matrix), which is not rare in the real word sce-nario. Similarity Preservation component is needed to main-tain the semantic similarity between the documents. If thedocuments are semantically similar, they must be mappedto similar codes, making the Hamming distance between thetwo minimal. This is handled using topic modelling thatmeasures the similarity across the documents instead of the“features from original keyword” [13]. Using topic modelling,the documents can be reduced to few topics, thus making iteasy for the similarity search Experiments conducted withthis model on various real world data which suggests thatthis approach generates more accurate hashing codes thanexisting methods and it also showcases the benefits of incor-porating hashing based on tag and preservation of semanticsimilarity between documents.

2.2 Recommendation based on social networkand QoS using semantic similarity

[11] Suggests that an end user’s service composition hasmashup as its important theory of support. In order to sup-port the end users based on the social interactions captureand analysis, the author of this paper provides a new con-cept. To feed a service recommendation construction theauthors concentrate on getting useful information extractedfrom social networks. The approach is based on social net-works, which develops interactions based on a variety of aparticular interest. This paper further provides recommen-dations obtained from social networks that makes use of im-plicit social interaction analysis. This social graph is aninference from the common composition interests of users.

The paper mainly focuses on defining the social structuresin the services area and how these structures could be influ-enced to simplify the problems involved in SOA (Service Ori-ented Architecture). To address this issue the authors pro-pose a framework known as Social Composer (SoCo) mainlyfor the composition and discovery of the services. The proto-col mainly transfers the user-service interaction to user-userinteraction where statistical processes are applied to providerecommendations in helping a user to construct a mashup.Like mentioned above the implicit social relation is obtainedbased on the activities of the users.

Experiments were conducted for evaluating the perfor-mance and the behavior of the framework proposed in thepaper. These experiments showed that the mashup size isthe main parameter to be considered and that it has a lin-ear relation with the response time formaking a recommen-

dation for a service. Two random datasets were generatedwith uniform statistical distribution. It was found that theresponse time is not all that important in the case of long-tail than that of uniformly distributed dataset. The resultsobtained through this research proved to be useful in in-teractive applications. But a major disadvantage is the factthat the algorithm response time is dependent to the mashupsize that happens to be distributed among two and five ser-vices. Thus a new approach for service recommendationfor mashup based on social network analysis was provided.Though the use of the SoCo framework has showed promis-ing changes, there are certain challenging issues such as al-gorithm behavior based on the deficiency in learning infor-mation and auto completion of mashup is to be addressedyet.

[14] States that the existing recommendation system worksby finding semantic similarity between the requirement de-scription from the user and functional details of the APIprovided. Few models do work on the basis of QoS (Qual-ity of Services) or on the basis of social relationship foundbetween APIs and mashup or on the user interest model.But the authors pointed out 2 limitations that exist in thepresent in the recommendation system. These are: The ser-vices are recommended and ranked to the users for mashupcreation but they are not categorized. Hence it becomesdifficult for the users to understand the ranking of a spe-cific API in a particular category of service. As the currentrecommendation system do not provide any insight on thecategorization of the services, it becomes cumbersome forthe users to look for the best service suited for their devel-opment task. To work with these limitations, the authorsbuilt up a three step model.

1. kMeans variant clustering algorithm: Unlike the tra-ditional clustering algorithm. This variant algorithmfirst looks for the relevant service in every categoryand then clusters the remaining services based on theirfunctionality. This helps in laying the basis of rankingbased on categorization.

2. Service category relevance ranking model (SCRR): Thismodels helps in explicit identification of the categoriesfrom the given mashup description. Further this isachieved by Category Topic Matching (CTM) and Cat-egory Affinity Propagation (CAP). CTM plays an im-portant role in looking for the best categories giventhe functional requirement of mashup, using proba-bilistic model to find functional relevance of mashupdescription to every existing category. With the out-put obtained from the CTM, CAP helps in findingmore relevant services based on collaborative filteringbased on usage history of the services.

3. Category-aware Distributed Service Recommendationmodel: The distributed framework of this model, helpsin finding the ranks of the services within categories.Thus this model helps in solving the problem of get-ting a list of ranked services consider to category sep-aration.

The working model has been developed in 2 parts, a cluster-ing model that works offline and a recommendation systemthat works online. Using clustering, the relevant categoriesare determined in the offline part. The existing mashups

and the services are clustered in categories based on the us-age history. In the core of this process, LDA is performedbetween the mashup feature matrix and the service featurematrix. As a result an initial base of recommendation sys-tem gets formulated in this mode. In the online portion,when the user requirement of mashup creation is received,it is converted into description of feature matrix and com-pared against the existing services’ feature matrix. Follow-ing which the SCCR model is used to determine the relevantranked APIs in different

categories. Finally a category aware distributed recom-mendation system is built taking help from the ranking ser-vices obtained previously. After this model was built, theexperiment was carried on the ProgrammableWeb dataset.The dataset consisted of 6,813 mashups and 7186 services.For the performance metrics, Normalized Cumulative Gain(NCG), Mean Average Precision (MAP) were used. Themodel outperformed the existing recommendation systemand proved to be efficient by showing 30% improvement onthe accuracy rate.

2.3 Recommendation tools and algorithmsPapers reviewed under this section helps to get a better

understanding on the advanced machine learning technolo-gies present in the field of LDA and collaborative filtering.[7] provides some idea on Collaborative Filtering and howit should address the dynamic exponential growth of data.Collaborative filtering is one of the techniques that is used inbuilding the recommendation systems. It builds the modelbased on user’s history of usage and users pattern of us-age. Collaborative filtering allows reducing the error ratethat may exist whilst trying to predict the hidden struc-ture in the user’s ratings. An ideal recommendation systemshould be capable of providing users choice even when thedata grows dynamically. But this demands constant tun-ing of the parameters. As predicting user’s ratings over agrowing dataset is difficult. With the growth in data, therecomes in a constant change in the number of user’s and thenumber of available ratings. To address this problem, theauthors of this paper suggested Collaborative filtering to bea time dependent method that is built on kNN algorithm.This method outperforms the existing algorithm with when“user-per neighborhood” [10] sizes are added and updated tothis approach. This method was experimented with severallarge datasets, and it proved to outperform existing meth-ods by making use of sequential classification in the growingdataset using kNN.

After learning about the collaborative filtering, [10] helpsin exploiting the matrix factorization method of recommen-dation system that exploits the structural semantics of aninformation. Matrix factorization is an effective method be-cause it allows to find out the latent features underlying theinteractions between users and items. Additionally, it justa mathematical tool. As the problem statement, the pa-per defines that very few matrix factorization model takesinto consideration the structure of the content. For exampleusers may express their preference over different kind of cui-sine like Indian, Thai, Chinese, Continental and many more.But each of these cuisines can be sub divide in several cate-gories. Making it essential to know about the hidden struc-tural content to predict user’s interest model. The authorpredicted a model to improve the efficiency of the matrixfactorization with the help of graph regularization on latent

users. There are two known approaches for extracting thehidden structures, (i) using graph partitioning theories toformulate the permutation view and (ii) using regulariza-tion techniques to perform feature selection. The modernapproach is to incorporate more information, in terms ofstructure of the item, social network along with the avail-able resources. One of the structural constraint proposedby the author is Laplacian constraint. This constraint pre-vents from gaining an idea on the community structure ofthe latent users while using matrix factorization. But withthe model, the authors suggested that with this advancedmethod of collaborative filtering, the structural constraintscan easily be exploited.

Additionally with Collaborative filtering, some researchwas on hierarchical topic modelling. [2] And [4], helps in un-derstanding about hierarchical topic modelling with a realworld example of nested Chinese Restaurants. This algo-rithm works in a way of assigning every word of a documentto k topics. Following it treats the cluster of k topics withthe group of words as a synthetic document and recursivelylooks for assigning words to the sub topics within the syn-thetic document. This helps in building a top down hierar-chical model of topic modelling. This idea would be utilizedin the upcoming milestone to focus on improving the exist-ing LDA which is used to find the similarity between mashupdescriptions and the users input to the system.

In a nutshell, [6] expresses the overall flow of the project,what is intended to work with but with improved algorithmsthat helps to increase the accuracy of the existing recommen-dation system. This paper suggests that, existing approachfor recommending Web APIs is based on functionality ofAPIs, QoS (Quality of Services) or social network. Each ofthese approaches focuses mainly on one aspect. (i) Func-tionality based approach provides the list of relevant APIsby using the API descriptions, semantic markups, tags, APIcategories and topic models- [5]. But the obtained list ofthese relevant APIs may have less QoS value. (ii) Addition-ally QoS based approach leverages the QoS value to findthe relevant APIs. Again this approach needs the user in-put on the functionally related APIs to evaluate its corre-sponding QoS, which is troublesome given the large dataset.(iii) Finally social network based approach makes use of thenetworking among mashups, APIs and users. This too de-mands user information, which may be unavailable many atimes. Overcoming the limitations of the above mentionedapproaches, this paper comes up with a novel idea to rec-ommend APIs for mashup creation using mashup’s textualdescription as the input. This approach uses the integra-tion of functionality of API, usage history of API, and itspopularity to estimate the recommendation relevant APIs.This well written paper demonstrates the advanced workconducted by its authors relating to the previous work donein this domain. The steps taken to solve the addressed prob-lem can be briefly enumerated as below:

1. API discovery based on functionality: LDA (LatentDirichlet Allocation) is used to identify functionallyrelevant APIs given the textual description of the mashup.The LDA model is trained based on topic modelling onexisting APIs textual description. This trained modelis used to find the topic distribution from the providedinput of mashup description. Following, cosine similar-ity is performed between these two topic distributionsand functionally relevant APIs are determined.

2. Adding additional API to the search space: To over-come the limitation of topic modelling in absence ofrich textual description of APIs, collaborative filteringtechnique is applied to identify the historical usage ofAPIs in the existing mashups.

3. Application of the Bayes theorem: Bayes theorem isapplied on the output of Step 1 and Step 2 to computean API’s posterior probability that is being used in thenew mashup. This step generates a list of top relevantAPIs.

4. Popularity based API ranking: In this final step, therelevant APIs as obtained from Step 3 are ranked basedon QoS, a nonfunctional property. It is assumed thatthe APIs with higher QoS are the ones that are mostfrequently used in the making of existing mashups.The output generated from this step is the solutionof the addressed problem.

To conclude, this paper demonstrates the experimental re-sults of the existing approach and the proposed approach.The proposed model generated a better performance with anincrease in recall and precision over the existing model. Butit leaves behind a scope to further improve the performance.

3. DESIGNThis section details the design of the proposed system.

The system has been designed in 3 different phases thatexploits the below features to build up the hybrid system.

1. Functionality of API

2. Usage history of API

3. Popularity of API

3.1 Phase 1: Topic Modelling: Exploiting APIfunctionality

This phase concentrates on finding the relevant APIs fromthe user’s input based on the functionality. We use HDPalgorithm in this phase.

Before understanding the working model of the hierarchi-cal topic modelling, here is what topic modelling refers to.Today the amount of information available online is growingexponentially. And when it is to look for relevant informa-tion from the huge online resource, the operations performedare search and links. With this, the search engines returnsrelevant information from the online repository and helpsin interactive interaction with the resources. It is manuallyimpossible to mark or group documents based on themesand topics that would help in efficient and relevant retrievalof the documents. Instead the approach probabilistic topicmodelling is used. Topic modelling is a statistical algorithmthat explores the documents, analyzes the words in the orig-inal documents and discover latent topics from them, andtheir inter relations with each other. It helps in findingthe main theme prevalent in the original documents and or-ganize the original texts based on the structure of themesexplored. With the growth in the information, and increas-ing requirement to retrieve the information efficiently, therehas been development in the field of topic modelling, for-mulation of various topic modelling algorithms. In the baseof all advanced topic modelling lies, LDA (Latent DirichletAllocation). LDA is a statistical model that tries to find the

Figure 2: Topic Modelling

latent intuition portrayed in the documents. It is a genera-tive process of finding out topics hidden in the documents.Topics are defined to be the distribution over a fixed setof words or vocabulary created from all the documents inconsideration. The model assumes that this vocabulary ispresent prior to the start of the process of finding hiddentopics from the documents and also that every documenthas the presence of multiple topics hidden in it. Followingit performs a 2 step process: (i) initialize random distribu-tion over topics and (ii) for every word in the documents, itrandomly assigns a topic as obtained from step (i) and ran-domly select a word from the vocabulary that correspondsto that topic [3]. The main objective of topic modelling isto find the topics given a collection of documents (corpus).Every original document of the corpus has its own hiddenstructure of the distribution of the topics in it which Topicmodelling (LDA) tries to dig out. LDA performs the analysisof the word using joint probability or posterior probabilitythat helps in computing the distribution of hidden variablesbased on the observed variables. To mention, the words inthe documents constitutes to the observed variables in andthe latent topic structure in the same are considered as thehidden variables by LDA. Few terms which are used in LDAare (i) β that gives the distribution of topic over the fixedvocabulary, (ii) θ that gives the proportion of a topic in adocument, θij gives the proportion of the ith topic in the jth

document, (iii) znj, the assignment of the nth word of the jth

document to a particular topic.In reality, the number of topic structures possibly hid-

den in the documents can be exponentially large enough.For the same, several approximating algorithms have beendeveloped, by taking into consideration the alternative dis-tribution of words over the hidden topic structure in thedocuments. One such commonly used algorithm is samplingbased algorithm aAS Gibbs Sampling. In Gibbs sampling,a sequence of random variables are generated that are de-pendent on the previous variable in the sequence. This se-quence generated is known as Markov chain. This enablesthe algorithm to approximate the distribution of topics bycollecting samples from the limited distribution of words. Inthis project, the advanced topic modeling algorithm, HDP(Hierarchical Dirichlet Process) is used that uses both Gibbssampling and LDA as its base.

Working principal of HDP: Hierarchical Dirichlet Pro-cess is a non-parametric Bayesian clustering algorithm. Thisalgorithm aims to cluster groups of data possessing mixturemodel. It works on the assumption that certain compo-nents or features are shared across multiple dependent datagroups, from which it tries to infer the shared properties ef-fectively and generalize new group models. The algorithmassumes that each data group consist of n number of datapoints with mixed property. Now when different data groupsexhibits varied features in various proportion, we can use thesame combination of mixture proportion to discover statisti-cal strengths that is being shared across multiple groups togeneralize new groups. It considers 2 important parameters:(i) Base distribution that tells about the prior distributionof over data items and (ii) concentration parameters thattells about cluster numbers and the existing sharing amountamong the groups. This approach when used in the field oftopic modelling proves to be very effective. In the field oftopic modelling, the data group is analogous to documentthat consists of distribution of words in it i.e N number ofdata groups or documents consists of M number or datapoints or words in it. Further the data group or documentexhibits mixture of features i.e it contains various topics inmixed proportions. Topics are the resemblance of the clus-ters. Therefore, HDP looks to explore the shared featuresamong the various data groups and generalize new topicsthat are unbounded given the set of documents. For thissystem formulation, HDP topic modelling approach is usedto look for relevant APIs given mashup description based onAPIs functionality.

Algorithm:[12]

1. An appropriate initialization of a non-parametric prior.This is referred as the Hierarchical Dirichlet Process.

2. This prior is gradually used in the grouping of the mix-ture models as stated above.

3. θ and β are defined a measurable space.

4. Set of random probability is distributed over the ran-dom measurable space.

5. Random probabilistic measure is assigned to each ofthe mixture model group termed as Gj.

6. G0 is the defined global probabilistic measure. It isdistributed with concentration parameter γ and basemeasure of H.

7. Gj is conditionally independent given G0.

8. The base probabilistic measure H has varied distribu-tion of global measure that is ruled by the concentratedparameter.

9. The actual distribution Gj of a group varies from theglobal measure G0. This is governed by the concentra-tion parameter.

10. The varied deviation in each of the group’s distributionis represented by αj.

11. Each group is assigned with the factor θij, that corre-sponds to a single observation xij.

Figure 3: Workflow of the proposed model

This process can be extended easily to multiple levels. Theoutput is in form of a tree, where each node is assigned withthe Dirichlet probability. The children nodes are condition-ally independent given their parent node.

Incorporating in the system : Using this HDP, the topicdistribution of the APIs and mashups are computed. HDPcontains of the training and testing phase. The model istrained using the name and description of the APIs. Afterthe model is trained, the topic distribution of the APIs areinferred. In the testing phase, the trained model is usedto predict the topic distribution of the mashups. For themashups also its name and description has been considered.

3.2 Phase 2: Collaborative Filtering: Exploit-ing API’s usage history

Using only API names and description in topic modellingdo not tend to provide a good result. As the descriptionis not rich enough always. For the same, there is a need toexploit the use of usage history of APIs in existing mashups.The same is achieved by collaborative filtering.

Collaborative Filtering : Collaborative filtering is a widelyused recommendation algorithm. It aims to predict and pro-vide recommendation for an item or product based on theexisting ratings provided by the user to the similar prod-ucts. Further based on the profile of the current users inthe system, an active user who share the similar profile likeothers would be recommended a product of choice similar toboth current and active users. There are many variations ofcollaborative filtering available, majority of which first per-forms prediction of the user’s preference. With the same itproduces their recommendations by ranking items by pre-dicted preferences. The baseline prediction mechanism inuse is the aggregation of the existing user’s ratings on prod-uct and computing the mean of the same. Collaborativefiltering captures the essence of the user and product’s rat-ing in the form of a matrix called Ratings matrix. Out of thevarious collaborative filtering techniques available, few are:

(i) User-user collaborative filtering: based on the profile andthe past likings of the other users that is similar active user,the ratings on an unrated product is predicted for the cur-rent user. But this method suffers from scalability problemwith the growth in the user space. (ii) Item-item collabora-tive filtering: this was to overcome the scalability problemfrom the previous method. Instead of using the user similar-ities for predicting the ratings of a product, the similaritiesbetween the products were considered. If a particular userhas a liking towards multiple products of the similar type,then it is likely, that the same user would like a new productof the same type. For all these basic collaborative filteringmethods, probabilistic models have been formulated. Thisis to compute the probability of the user’s fondness towardspurchasing a new item or the probability distribution overthe rating of user on an item. Collaborative filtering usestwo methods- neighborhood or memory based and modelbased. In the memory based method, the entire user itemmatrix is consumed in order to predict. This model usesthe statistical methods to find a nearest set of neighbors,whose profiles are used to predict the recommendation forthe active or target user. But this technique is ineffectivewith the growth of the dataset and becomes sparse. In themodel based approach, first a model based on users’ rat-ings is generated using machine learning mechanisms likeBayesian network or clustering. It considers the concept ofmatrix factorization where they consider the user item rat-ing in a matrix. With this it aims to discover the latentfeatures that could be hidden in the other users’ ratings andaid in making predictions. Many probabilistic models havebeen generated that contains the direct connection of hid-den variables with the user ratings. It also relies on SingularValue Decomposition (SVD). It finds the matrix R = UTVsuch that the sum-squared distance to the target matrix Rget minimized. In the real world data we can have data spar-sity and in many cases the entries in R would be missing,the sum-squared distance is considered only for the existingentries of the target matrix R. For the formulation of thissystem, Probabilistic Matrix Factorization is used. Itworks on the Bayesian perspective and uses Gaussian distri-bution. It overcomes the drawbacks of the existing matrixfactorization like robustness and sensitivity to noise or miss-ing entries Algorithm:[1]

1. Initialize User and Item result vector to random values

2. Iterate over n times to get the better average results

3. Divide testing set over multiple batches

4. Compute the prediction as weighted average with vari-able factor lamda using Bayesian distribution.

5. Compute Gradients.

6. Update the training result vector based on the predic-tion and gradient.

7. Compute the prediction after result update and calcu-late the RMSE (root mean square error).

8. As an output the item and user feature matrix is gen-erated.

Incorporating in the system :For incorporating this methodin this system, mashups are treated as users and APIs are

treated as the items. Further collaborative filtering has thecold start problem, i.e, the system aims to find the list of rel-evant APIs based on the user input of the mashup. Againthis mashup would be non-existing in the system, makingit difficult for the formation of the user item matrix. Butthis system overcomes the cold start problem by using topicmodelling, where based on the user input of mashup descrip-tion, the similar mashups from the system is obtained. Thisis achieved by getting the topic distribution of the user in-put based on the trained model of HDP and computing thesimilarity with the topic distribution of the existing mashup.

3.3 Phase 3: Popularity Based API rankingAs per the proposed system, we have obtained by rel-

evant APIs from Hierarchical Dirichlet Process using thecosine similarity with the mashup. This helps to leveragethe functionality of the APIs in providing the recommenda-tions. The output of this model are APIs with probabilisticscores. The greater the score, more relevant they are. Sim-ilarly, to leverage the historical usage of APIs in mashup,we are using collaborative filtering with probabilistic matrixfactorization. As an output of this collaborative filtering,we can obtain the relevant APIs in a mashup based on itsprobabilistic score. The outputs obtained from these com-ponents of the system can be treated as some score in therange of [0, 1]. To generate the final relevant list of APIsfrom these 2 systems, we can use apply the Bayes theorem.The probability of an API getting selected from HDP, givena mashup can be treated as p(at|m). Similarly the one fromcollaborative filtering can be treated as p(au|m). Assumingthe conditional dependency, we can say:

p(a|m) = p(at|m)× p(au|m)

But we are looking forward to compute given an API,how likely it would be used in a given mashup. This can bederived using the posterior probability: p(m|a). With theapplication of Bayes theorem, we can compute the same as:

p(a|m) ≡ p(at, au|m)× p(m) = p(at|m)× p(au|m)× p(m)

Where p(m) is the prior probability of observing the mashup[10]. This value can be assumed to be as 1/M, M being totalnumber of mashups used in the experiment.

As an output from Bayes theorem, we may obtain thelist of relevant APIs, whose score are identical and they dooffer identical functionality as well. To handle such scenario,we must the rank the APIs based on its popularity as well.This nonfunctional measure of popularity cab be termedas QoS (Quality of Service). Popularity can be defined asthe frequency of the API that has been used in any of themashups present in the database. The more the numberof times the API has been used in an existing mashup, thebetter the rank it has got. To conclude, following all thesesteps, the proposed system would be able to recommendtop-k relevant APIs for a mashup creation based on bothfunctional and non-functional requirement.

4. IMPLEMENTATION

4.1 Data PreprocessingThe data used in this project was taken from Programmable-

Web. The data is the one collected from summer 2014 us-ing ProgrammableWeb APIs. The dataset consists of 11199APIs and 7391 mashups. The data available is the form of

Figure 4: Data pre-processing steps

raw text file. Each line in the text file corresponds to eachof the API and Mashup.

Preparing HDP input: Only the data corresponding tothe attribute ‘description’ and ’name’ were required. Thosefor the APIs were needed for building the trained modeland the same corresponding to the mashups were neededto infer from the trained model. As a part of data pre-processing, only the description value was retrieved fromthe raw text file. Further the unwanted characters wereerased from the description. Unwanted characters included:,.;:=+?@%$* |)([]] [0-9]- . Following, the stop-words wereremoved from the description. Stop words are the wordsthat occur very frequently in a language and requires to befiltered out before the processing of natural language (eg: as,and, the, was, are, etc). Python pyenchant module was usedto validate the words against English dictionary. Finally thetool POSTagger was used. It is a natural language process-ing tool. This tool helps in tagging individual word undernoun, adverb, adjective, verb, preposition, and more. Forour purpose of building the vocabulary, only the nouns wereconsidered. This was to develop a meaningful vocabulary.

4.2 Building HDP ModelThe algorithm requires input data in a particular format

to train the model. The format of the input data is: [Noof unique terms in an API] [Index of first word]:[frequencyof the first word] [Index of second word]:[frequency of thesecond word] . . . .[Index of nth word]:[frequency of the nth

word]. Thus the number of lines in the input file is equal tothe number of documents or API in our dataset. The indexof the word was obtained from the vocabulary generatedfrom the descriptions of all the APIs. Java code was writtento generate the vocabulary from the APIsaAZ descriptionand also to build up the input file data.

1. Environment setup: HDP algorithm package is builton c++. And it is dependent on the GSL (GNU Sci-entific library) package. For making the algorithmwork and configure, a virtual machine was built torun on Ubuntu 14.0. Following the GSL package wasinstalled. Necessary changes were also made in thelibrary paths and environment variables.

2. Training the model: The vocabulary file and input

file generated as mentioned in the above step was keptin the working directory. And directory was created tostore the output results. Command to build a trainingmodel: hdp —algorithm train —data <input data>—directory <result-directory> . Several other param-eters can be passed with the command to tweak themaximum iterations and the initial count of topics thealgorithm must consider to start building the model.

3. Output obtained: Once the model has been trained,it generates several files in the result directory. Follow-ing 2 files were studies to get an insight on the resultand the performance of the HDP algorithm. Mode-topics file contains the number of topics generated andthe frequency of each words from vocabulary. Eachline in this file corresponds to the single topic. Theformat of the mode-topics file: [Frequency of the 1stword from the vocabulary in topic i] [Frequency of the2nd word from the vocabulary in topic i] [Frequency ofthe 3rd word from the vocabulary in topic i] . . . . [Fre-quency of the nth word from the vocabulary in topici] ,where ( i ranges from 1 to the number of topics ) (ncorresponds to the 1 to the number of words presentin the vocabulary) Eg: Considering 2 topics generatedfrom the vocabulary of 4 words: 0001 0030 0060 00120023 0032 0012 0007This illustrates that from the entire corpus of docu-ments (APIs), 2 topics got generated. 0 topic containsonly 1 occurrence of 1st word from the vocabulary,30 occurrence of the 2nd word from the vocabulary,60 occurrences of 3rd word from the vocabulary and4th word occurs only 12 times. Similarly the distri-bution of the words in topic1. Mode-word-assignmentfile contains the distribution of every word from everydocument to the topics generated. The file stores amatrix in the form of [d w t] where d is the document-id (ranging from 0 to the 1 less than the number ofAPIs), w is the index of the word from the vocabularyas present in the corresponding document and t is thetopic id (ranging from 0 to 1 less than the number oftopics formed while training the model).

4. Visualization of the output: We can try and visu-alize the topics generated while building the trainingmodel. R script was run to generate the list of topfew words which appear most frequently in each of thetopics. Command to run the script: ./printTopics.Rmode-topics.dat <vocabulary file> <output file>

5. Testing of the trained model: After the trainingmodel was built, it was tested with the user inputof mashup description and also the existing mashups.This inferred the topic distribution of the mashups anduser input.To run the trained model to test the userinput, command used was: ./hdp —algorithm test —data <user test input-data> —saved-model <binaryof the trained model generated>—directory<test result-directory>. Similarly output files were generated.

6. Computing the cosine similarity: Once we ob-tained the word topic distribution of the APIs, thetraining model was used to infer the topic distributionof the existing mashups. This helps to find the relevantmashups based on the user-input and then look for therelevant APIs. The similarity measure used is cosine

similarity. It is first computed between the user-inputtopic distribution and the mashup distribution. Onceobtained, the mashups showing similarity more than0.7 were considered. With this candidate mashups,the similarity was computed with every existing APIs.And the APIs that showed similarity more than 0.7were considered to be the candidate list from HDPphase. Cosine similarity is computed between 2 doc-uments using a Java code that takes input as vectorsand computes the similarity. The value of similarityranges from 0 to 1, 1 being completely similar.

4.3 Performing Collaborative FilteringMatlab package [1] for probabilistic matrix factorization

was used. For this experiment, the list of candidate mashupsconsidered were the ones that uses at-least 4 APIs. This re-duced our dataset for collaborative filtering to 909 mashupswith 852 APIs. This takes the input in the form a matrixwhich has the combination of Mashup id and API id with astatic value in every row. Java code was written to formu-late this input matrix for the algorithm. The code was runfor various epochs and RMSE (Root Mean Square Error)was calculated. The model giving the minimum RMSE wasmade to generate the API feature matrix and Mashup Fea-ture Matrix. Followed by this java code was written to pro-cess these 2 matrices and generate the actual ratings matrixthat helped in finding the mostly used APIs in the mashupcreation. The value in the matrix is probabilistic value rang-ing from [0, 1]. To look for the relevant APIs for a particularmashup, the ones with the values more than 0.7 were con-sidered.

4.4 Bayes Theorem and Popularity Based APIRanking

Based on the candidate list of APIs obtained from HDPand Collaborative Filtering, Bayes Theorem was applied toget the combined list. The candidate list obtained representeach API with a score: the probabilistic value between [0, 1].Using the same in the same in the below formula the score forall the APIs were obtained. Finally they were ranked basedon the popularity. The popularity of an API was measuredbased on the frequency of it has been used in the existingmashups.

p(at, au|m)× p(m) = p(at|m)× p(au|m)× p(m)

5. RESULTHDP modelled applied on the names and description of

the APIs, yielded 73 topics. Below is the visualization ofthe topic probability distribution across few documents.

Below are few of the topics with the words appearing themost frequently in the topics. Different color shows the dif-ferent frequency of the words in the topic. The frequency ofthe words increases from right to left

As a result of collaborative Filtering it was found thatGoogle Maps, YouTube, Facebook were among the mostused APIs by the majority of the Mashup.

6. EVALUATIONTo evaluate the system, we divided the current mashups

present in the system in the training set and testing set.The ration of division was 8:2. For the evaluation metrics,we used recall, precision and F-score.

Figure 5: Topic distribution on the API’s name anddescription

Figure 6: Word Topic distribution over API’s nameand description

Figure 7: Recall with different No. of API’s recom-mended

Recall can be defined as the ratio of the number of cor-rectly recommended APIs by the system to that of totalnumber relevant APIs.

Recall =|Relevant APIs|∩|Recommended APIs|

Relevant APIs

Precision can be defined as the ratio of the number ofcorrectly recommended APIs by the system to that of totalnumber recommended APIs.

Precision =|Relevant APIs|∩|Recommended APIs|

Recommended APIs

F-score considers both recall and precision.

F − Score =Recall * Precision

Recall + Precision

For the computation of these evaluation metrics, we usedtotal 300 mashups and computed the average the value. Thecomputation was carried across different values of k rang-ing from 10-100. Having an interval of 10. This helped ingetting the recall and precision when only k-top relevantAPIs are needed. Followed by this system has also beencompared with the recommendation system based on onlyHDP and collaborative filtering using probabilistic MatrixFactorization. We also compared this work done with theprevious work published along with the system that usedLDA and traditional collaborative filtering instead. Belowfigures shows the graphs for the same.

From the graphs we see that HDP does not perform quitewell, because it only considers the functional requirementwithout taking into the count the usage history of the mashups.On the other hand we see little improvement in collaborativefiltering with matrix factorization as it considers the usagehistory of the mashup. But it suffers from the cold startproblem. But with the hybrid system, it helps to provide abetter recommendation. We also add to this the popular-ity based QoS factor that refines the recommendation. Thiswork has been compared with the other 2 works done in [6]and [8]. Similar metrics have been used in the evaluation inthis system as mention in [8]. This is to ensure there is a

Figure 8: Precision with different No. of API’s rec-ommended

Figure 9: F-Score with different No. of API’s rec-ommended

Figure 10: Comparison of the system’s performancewith the work published

fair comparison. Below table shows the comparison of therecall and average precision across all the 3 systems.

7. CONCLUSIONIt can be concluded that proposed approach yields better

result over traditional methods of topic modelling and col-laborative filtering. In this paper, new advanced algorithmsfor topic modelling and collaborative filtering were tried. Ittried to overcome many problem of the traditional processand yielded a better result in providing recommendation ofAPIs for mashups.0

8. FUTURE WORKFuture work involves improving the performance of the

system by use of other advanced algorithms. This can alsobe done by careful study of the data attributes. As only‘name’ and ‘description’ attributes has been considered inthe current approach, this can be extended to use of cate-gory, tags, ratings and other attributes present in the dataset.Using these data attributes may show some interesting rela-tionship. Further the concept of web composiblity may beleveraged in increasing the effectiveness of the system. Thiscan be explained with a simple example: Google maps andyahoo maps almost cater the same functionality but they aredifferent in their own way. When building a mashup thatrequires the use of map API, Google API is more likely to berecommended as its usage history in the existing mashups ismore. But sometimes may be the Yahoo map API may bestsuit the requirement of the user compared to the Googlemap API. This issue can be resolved using the idea of webcomposibility.

9. ACKNOWLEDGEMENTThank you to Dr. Xumin Liu, because of whom, this

project exists today and who has helped me in enriching myknowledge in this domain. Thank you to my friends andfamily, because of whom, I work harder every day and fortheir continuous support in hard times. Thank you to theRIT CS Department, because of which, I have my Master’sdegree.

10. REFERENCES[1] Probabilistic matrix factorization -

https://pymc-devs.github.io/pymc3/pmf-pymc/.

[2] T. H. Alison Smith and M. Myers. Interactivevisualization for hierarchical topic models.

[3] D. M. Blei. Hierarchical dirichlet process (withsplit-merge operations)-https://github.com/blei-lab/hdp.

[4] D. M. Blei, T. L. Griffiths, M. I. Jordan, andJ. B.Tenenbaum. Hierarchical topic models and thenested chinese restaurant process.

[5] B. Cao, J. Liu, M. Tang, Z. Zheng, and G. Wang.Mashup service recommendation based on userinterest and social network. In Web Services (ICWS),2013 IEEE 20th International Conference on, pages99–106, June 2013.

[6] A. Jain, X. Liu, and Q. Yu. Aggregating functionality,use history, and popularity of apis to recommendmashup creation. In 16th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 188–202. Springer Berlin Heidelberg.

[7] N. Lathia, S. Hailes, and L. Capra. Temporalcollaborative filtering with adaptive neighbourhoods.

[8] C. Li, R. Zhang, J. Huai, and H. Sun. A novelapproach for api recommendation in mashupdevelopment. In Web Services (ICWS), 2014 IEEEInternational Conference on, pages 289–296, June2014.

[9] X. Liu, Q. Zhao, G. Huang, H. Mei, and T. Teng.Composing data-driven service mashups withtag-based semantic annotations. In Web Services(ICWS), 2011 IEEE International Conference on,pages 243–250, July 2011.

[10] J.-G. Lou, Q. Fu, S. Yang, J. Li, and B. Wu. Miningprogram workflow from interleaved traces. In 16thACM SIGKDD international conference on Knowledgediscovery and data mining, pages 613–622, July 2010.

[11] A. Maaradji, H. Hacid, R. Skraba, A. Lateef,J. Daigremont, and N. Crespi. Social-based webservices discovery and composition for step-by-stepmashup completion. In Web Services (ICWS), 2011IEEE International Conference on, pages 700–701,July 2011.

[12] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.Hierarchical dirichlet processes.

[13] Q. Wang, D. Zhang, and L. Si. Semantic hashing usingtags and topic modeling. In Proceedings of the 36thInternational ACM SIGIR Conference on Researchand Development in Information Retrieval, SIGIR ’13,pages 213–222, New York, NY, USA, 2013. ACM.

[14] B. Xia, Y. Fan, W. Tan, K. Huang, J. Zhang, andC. Wu. Category-aware api clustering and distributedrecommendation for automatic mashup creation.IEEE Transactions on Services Computing,8(5):674–687, Sept 2015.

Recommendation of APIs for Mashup creation€¦ · mend mashup to users based on the usage history...

Documents

Transcript of Recommendation of APIs for Mashup creation€¦ · mend mashup to users based on the usage history...