Contropedia @Chi2015 - Societal Controversies in Wikipedia Articles
Using Full-text Academic Articles and Wikipedia to Find ...
Transcript of Using Full-text Academic Articles and Wikipedia to Find ...
Data and Tools
Abstract
Contact InformationShutianMa:[email protected] Zhang:[email protected]
UsingFull-textAcademicArticlesandWikipediatoFindAlternativeFreeBioinformaticsSoftwareShutian Ma;Chengzhi Zhang
DepartmentofInformationManagement,NanjingUniversityofScienceandTechnology,Nanjing,China,210094
Experimental Result
v Scientific literature114,510 papers in XML format - PLOS ONE11,013 articles containing bioinformatics software
v Software list143 specific bioinformatics software - Wiki20 commercial software according to their licensesOnly 97 software have info box information
vToolsLSI and Doc2Vec - GenismPython package of LDANode2Vec - OpenNEDiffer2Vec and struc2vec - Github
Taking bioinformatics software as a case study, this paper wants to findfree software which are similar with commercial ones and havepotential to be alternatives. Content and network information areapplied for preference-oriented results, which encapsulates similarity inhow people describe them in wiki and how people use them inresearch.
What do we want to do? Find Free Software!
Method
RepresentSoftwareUsingFull-textandWiki
ComputeSimilaritybetweenFreeand
CommercialSoftware
ConstructGroundTruthandEvaluate
RecommendationResult
Except categories which startwith characters "List of".
FindBioinformaticsSoftwareInformationthat
TellsusWhichisFree/Commercial
Information in Wiki infobox is converted into“Knowledge Graph”
Use returned results numberto calculate normalizedpointwise mutual information
Paper-based SoftwareRecommendation Performance
Wiki-based SoftwareRecommendation Performance
Conclusion
þ UseWikicontento Usefull-text papercontentþ Wikicontent oWikigraphþ CombineWikicontent&graphoWikicontentoWikigraphþ Graphembedding canhelptoimprovePaper-based
recommendation.þ Combineallinformation, recommendation performance isn’tgetting
muchhigherasexpected.
ü Wikipediacontentandinfoboxcanbebalancedtogether foranefficient softwarerecommendationtechnique.
ü Graph-based information canhelptorichsemanticinformation.
ü It’snotsuitabletousesuchkindoffull-text publicationdatasettorepresententities likesoftwareorsomeothers inresearch
when doing linear combinations,all weights are set to be one.
v Software Representation GenerationLSI, LDA and Doc2Vec – Represent software via vectors based oncontent in 100 dimensionsNode2Vec, Differ2Vec and struc2vec – Represent software via vectorsbased on graph in 128 dimensions
v Software Graph Construction
NodeType Value
developedbywhatkindofteam university,companyandperson
yearofstablerelease 14differentyears
writteninwhatkindofprogramminglanguage 17languages
operationsystem Linux,Unix,WindowsandMacOS
appliedplatform 6kinds
availablelanguage Englishorcross-language
softwaretype 44types