Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and...
Transcript of Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and...
Journal of Software
300 Volume 13, Number 5, May 2018
WebCrawlingandProcessingwithLimitedResourcesforBusinessIntelligenceandAnalyticsApplications
LoredanaM.Genovese,FilippoGeraci*InstituteforInformaticsandTelematics,CNR,ViaG.Moruzzi,1Pisa,Italy.*Correspondingauthor.Email:[email protected],2018;acceptedMarch8,2018.doi:10.17706/jsw.13.5.300-316
Abstract:Business intelligence (BI) is the activity of extracting strategic information frombig data. Thebenefitsofthisactivityforenterprisesspanfromthereductionoftheoperativecostsduetoamoresensibleinternalorganizationtoamoreproductiveandawaredecisionprocess.Tobeeffective,BIreliesheavilyontheavailabilityofahugeamountof(possiblyhigh-quality)data.Thesteadydecreaseofcostsforacquiring,storingandanalyzinglargeknowledgebaseshasmotivatedbigcompaniestoinvestinBItechnologies.Untilnow, instead,SMEs(SmallandMedium-sizedCompanies)areexcludedfromthebenefitsofBIbecauseoftheirlimitedbudgetandresources.InthispaperweshowthatasatisfactoryBIactivityispossibleeveninpresence of a small budget. Our ultimate goal is not necessarily that of proposing novel solutions butprovidingthepractitionerswithasortofhitchhiker’sguidetoacost-effectiveweb-basedBI.Inparticular,wediscusshow theWebcanbeusedasa cheapyetreliable sourceof informationwhere crawling,datacleaningandclassificationcanbeachievedusingalimitedamountofCPU,storagespaceandbandwidth. Keywords:Bigdataanalytics,businessintelligence,spamdetection,webclassification,webcrawling.
1. IntroductionIntelligence is theactivityof learning informationfromdata.Thebenefitsof itsapplicationtobusiness
activitieshavestartedtobediscussedfromthemidofthenineteencenturywhenHansPeterLuhncoinedthetermbusinessintelligence(BI)inapaperofhim[1].InthatpaperLuhn,describedsomeguidelinesandfunctionsofabusinessintelligencesystemunderdevelopmentattheIBMlabs. In the courseof theyears,BIhasevolvedextendingoutside the scopeof thedatabase communityand
increasingitscomplexity.Structuredinformationhasbeenreplacedbyunstructuredone;datavolumesgotbigger posing new challenges both for gathering, storing and analyzing them; analytics (namely: thepossibility of creatingprofiles fromdata) has emergedas anew opportunity to help companies to drivetheir business efforts on the most promising directions. Forecasts on future trends [2] predict an everincreasingroleofBIandbigdataanalyticsinthesuccessofcompanies.However,thecostofimplementingBIhasexcludedsmallandmediumenterprisesfromthebenefitsofbigdataanalysis.Infact,collectingandstoringdataisstillconsideredexpensiveandcomputationallydemanding.Althoughpracticalguidesasthatin [3] provide wise suggestions to build scalable infrastructures that can optimize long-term costs, theinitialeffortrequiredtosetupaminimalisticBIsystemstillremainsunaffordable.In thispaperwe dealwith the problemof buildinga good quality business intelligence infrastructure
keepinganeyeon the companybudget.Achieving this goal requiresa careful choiceof thedata sources,extraction procedures, and pre-processing. As other studies suggest, given the universal accessibility of
Journal of Software
301 Volume 13, Number 5, May 2018
contents as well as its intrinsic multidisciplinarity, the Web can be considered as the most convenientsource of information. However, using the Web poses several issues in terms of data coherence,representativeness, and quality. Moreover, technical challenges have to be considered. In fact, althoughdownloadingapageseemstobeforfree,crawlinglargeportionsofthewebmayquicklybecomeexpensiveand computationally demanding. Seed selection is a key factor to ensure the data to be up to date. Forexample, new initiatives can be promoted by registering ad-hocwebsites thatmay not be reachable yetfromotherwebsitesbutarespreadthroughotherchannelssuchassocialmedia.Downloadorderingcaninfluencerepresentativenesswhiledatafilteringcanbiascoherence.InSection2wediscusscrawlingstrategies,tailoredontheneedsofsmallcompanies,thatcandrastically
simplify the design of crawlers and data storagewithout sacrificing quality. In particular, analyzing thebottlenecks of standard crawling,we discovered thatmost of the CPUeffort is due to the need to checkwhetheraURLhasalreadybeendiscovered/downloadedornot.However,amorein-depthanalysisshowsthat,leveragingonanad-hocdownloadingordering,thesecheckscanbereducedinthespecificcontextoflimitedresources.This,inturn,hastheeffectof:simplifyingthestoragearchitecture,andprovidingadatastreamaccessforthesubsequentanalyses.Insection3wetrytoanswertothequestionaboutwhetherwebdataissuitableforBIornot.Statistical
analysesintheliteratureshowthat,althoughmorphologicallydifferent,webcorporahavethepotentialtobe good knowledge bases once noise is filtered.We pinpoint spam andweb templates as the twomainsourcesofnoise.Besidesdata coherence,spamhasaprofound impacton the consumptionofbandwidthandstorageresources.Inordertolimitresourcewaste,weshowhowtoexploitDNS(domainnameserver)queriestopredictwhetheranewlydiscovereddomainislikelytobespamand,incase,removeitfromthelistofwebsitespendingtobedownloaded. Insection4wediscusstheproblemofper-topicclassificationcomparingpost-processingwith focused
crawling.Inabsenceofresourcelimitstheformersolutionwouldbepreferablebecauseitpreventsausefuldocument linked by an off topic website to be lost. However, in a real scenario the higher cost ofpost-processing compared to its benefitsmakes focused crawlingmore feasible. In either case, however,dealingwithclassificationrequirestoownalargesetofexamplesthat,atleastforspecializedcompanies,canbehardtofindinpubliclyavailableontologiesorannotatedresources.Thus,weshowhowsuchasetofexamplescanbebuildusingsemi-supervisedtechniquesthatbalanceaccuracyandhumaneffort. Summarizing: in this paper we investigate the technological and methodological challenges that IT
practitionershavetodealwithbuildingaBIarchitecturewithalimitedbudget.WealsoproposesolutionsthatcanhelptoattainasatisfactorytradeoffbetweencostsandBIquality.
2. CrawlingCrawling is theactivityof creatinga local copyofa relevantportionof theWorldWideWeb.This task
usually represents the first challenge that IT professionals have to deal with performing BI and Webanalytics. In its simplest form, a crawler is aprogram that receivesas inputa listof validURLs(calledseeds) and starts downloading and archiving the corresponding websites. Intra-domain hyperlinks areimmediately followed in breath-first order,while links to external resources are appended to the list ofseeds. Thecrawlingprocessterminatesoncethelistofseedsbecomesempty.Despitesimpleinprinciple,the variety of purposes forwhich data is collectedaswell as technical issues,made crawling a rich andcomplexresearchfield.Asameasureoftheinterestforthistopic,inarecentsurvey[4]theauthorsclaimtheycensused1488articlesaboutcrawling.Restrictingtothereal implementationstheauthors listed62works. In the following sectionwedissect themain crawlingdesign strategiesandpropose cost-effectiveoptionstailoredontheBIprocess.
Journal of Software
302 Volume 13, Number 5, May 2018
2.1. SeedSelectionTheseed list is the initialsetofURLs(moreoftenwebsites’homepages)taken in input fromacrawler
starting to build theweb image. Despite not in-depth discussed in the scientific literature about generalpurposecrawling,howtobuildacomprehensiveseedlistisoneofthemostdebatedtopicsinspecializedforums. In[5]theauthorsprovideasuccessfulmacroscopicdescriptionofthewebthatcanbeusefultospeculate
ontheeffectsofseedselection.Theystartrecallingthatthewebcanbemodeledasadirectgraphwherepagesarenodesandhyperlinksareedges.According to theexperiments in [5],webpages canbedividedintofivecategoriesthatformthefamousbow-tieshapedgraph.ThecoreofthewebconsistsintheSSC:astronglyconnectedsubgraph(i.e.agraphwheregivenanytworandomnodesthereexistsapathconnectingthem).TheSSCislinkedfromtheclassofINpagesandlinkstheOUTpages.ThetwootherrelevantclassesaretheTENDRILS(namely:pagesnotconnectedwithSSC)andthedisconnectedpages(i.e.pageswithbothin-degreeandout-degreeequalsto0).CrawlingtheSSCandOUTcomponentsisrelativelyeasy.Infact,givenasingleURLbelongingtoSSCitis
theoreticallypossibletofindapaththatconnectstoalltheotherelementsofSSCand,inturntoOUT.Theelementsofthesecategoriesaretypicallypopularwebsites,thusfindingpre-compiledlistsofthemonthewebisasimpletask.Pages in IN and in TENDRILS, which often are newwebsites not yet discovered and linked, are very
difficultoreven impossibletobediscovered. In fact,thesepageshaveeither in-degreeequalsto0orarelinkedfrompagesofthesamecategory.Accordingtotheestimations in[5] INandTENDRILSaccount forabouthalfoftheentireweb.Thus, inordertocrawlasmanypagesaspossibleof thesetwoclasses, theirURLsshouldbepresentintheinitialseedlist.Collaborativeeffortshaveproduced large listsofdomains freelyavailableontheweb.Despiteuseful to
increasethecoverageoftheINandTENDRILSclasses,theseresourcescannotbekeptupdatedbecauseofthe dynamicity of the web. Recently, harvesting URLs posted on Twitter has emerged as an effectivealternativesourceoffreshandupdatedseeds.AfirstapplicationofTwitterURLstoenhancewebsearchingisshownin[6], theabilityofthissocialmediato immediatelydetectnewtrends isproven in[7],whileamachinelearningapproachtoidentifymaliciousURLsonTwitterpostsisdescribedin[8].
2.2. URLCrawlingOrder In the context of limited resources, the URL download ordermatters [9]. Pioneering papers like [10]
indicatethebreath-firstasthestrategythatprovidesbetterempiricalguaranteestoquicklydownloadhighquality(intermsofpagerank)pages.Theintuitionatthebasisofthisideaisthatpagequalityissomehowrelatedtothedepthfromthedocumentroot[11]. In[11]theauthorsprovideanicecomparisonofseveralorderingstrategies.Accordingto[11],themain
qualifying feature to classifyorderingalgorithms is theamountof information theyneed,going from:nopre-existinginformationtotherequestofhistoricalinformationabouttheweb-graphstructure.In the context of BI for small andmedium enterprises, however, supposing the existence of historical
information(i.e.apreviouscrawling)maybeinappropriate.Infact,smallcompaniesarenotsupposedtobeabletorefreshtheircrawlingtoooften.Thus,evenifexisting,historicalinformationcouldbeoldenoughtobecome misleading. Hence, ordering strategies that do not require pre-existing information should bepreferred. In[12]theauthorsproposeahybridapproachinwhichwebsitesarekeptsortedusingapriorityqueue
andpriorities aregivenby thenumberofURLsperhost alreadydiscovered,butwhichdownload is stillpending. Once a website is selected, a pre-defined number of pages is downloaded. This strategy hask
Journal of Software
303 Volume 13, Number 5, May 2018
severaladvantagesintermsofnetworkusage.Infact:thenumberofDNSrequestscanbereducedthankstocaching,while the use of thekeep-alive directive [13] reduces network latency. Experiments in [12]provethatthisstrategystilldownloadshighqualitypagesfirst. Buildingupon[12]weproposeanorderingstrategythatcandrasticallysimplifythecrawlerarchitecture
inthepracticalcaseoflimitedresources.Weexploittheimportantobservationfrom[14]wheretheauthorsobserve that chains of dynamic pages can lead to an infinite website even though the real informationstoredinthewebserverisfinite.Accordingtothisobservation,settingalimittothemaximumnumberofpagesperhostappearssensible. Wethuskeepthehostssortedasin[12],butonceawebsiteisselected,thedownloadingprocessisdone
allatonce.Somenecessarycaveatsarediscussedinthesubsequentsections.
2.2.1. PolitenessCrawlingetiquettewouldrequirenottoaffectwebserverperformancesoverloadingitwithdownloading
requests. According to theexperiments in [12], a time interval of 15 seconds is appropriate for real lifeapplications so as to balance politeness constraints with the timeout of the keep-alive directive.Moreover,usingatime-seriesanalysisofwebserveraccesslogsasin[15],weobservedthatalimiteddegreeof tolerance to this limit could be acceptable in the initial phase of download. This comes from theobservation of the browser behavior. In fact, since to be rendered, a page may require severalsupplementaryfiles(i.e.CSSdirectives,images,script,etc.),inordertopromptlydisplayit,browsershavetodownloadall these files inparallel.Asaresult,mimickingthebrowserdownloadpatterncanspeedupcrawling(atleastofsmallwebsites)withoutinfluencingthewebserverperformance. Despite the above considerations, our URL ordering policy still requires a large per-host degree of
parallelismtomaximizenetworkthroughput.Inordertoavoidbreakingtheetiquetterulesbecauseoftheuseofvirtualhosts,however,somecarefulisrequiredintheselectionofthehoststhataredownloadedinparallel. In fact,evenlowratesofrequestscanbecomeproblematic iftheyhavetobeservedbythesamephysicalmachine.We thusmodifiedtheorderingpolicydescribed in [12] selecting thehostwithhighestpriorityamongthosethatnotresolvetoanIPalreadyindownload.
2.2.2. URLsmanagementAccordingtothebreath-firststrategy,onceaURLisfound,acrawlerhastocheckwhetheritisalreadyin
the list of those known and, if not, append it to the list. Since this check is URL-oriented, it has to berepeated foreachoutgoing link,both internal andexternal, inawebpage, resulting ina computationallydemandingoperation.Inparticular,distributedcrawlingarchitecturessplittheURLlistamongnodes,thuscheckingforaURLandupdatingthelistmightalsoimplyextranetworktraffic. Our crawling strategy introduces a drastic simplification in the URL management. Since a host is
downloadedallatonce,wedonotneedtokeeptrackofeachindividualURLinthecrawling,butitsufficesper-host information. In fact, a newly discovered external URL, will be retrieved again once thecorrespondingwebsite isdownloadedunlessthecrawlerreachedthemaximumnumberofpages forthathost or the corresponding page was not connected to the home. However, except for the case of adisconnected page belonging to a small enoughwebsite (that we expect to be ofmarginal utility for BIapplications),keepingthenewlydiscoveredURLdoesnotchangeitscrawlingfate,thusitcanbediscarded.WearrangeURLsintwodatastructures:theintra-linkstableandthehostqueue.Anintra-linkstable(see
Fig. 1) is locallymaintained by each individual crawling agent and destroyed once the download of thecorresponding website is completed. This data structure is used to maintain the set of internal pages(sortedbydiscoverytime).Weimplementedintra-linkstablesasfasthashsortedsets.
Journal of Software
304 Volume 13, Number 5, May 2018
Fig.1.Intra-linktableexample.
ThehostqueueisapriorityqueuestoringthehostURLandIPaddress(thatislookeduponlywhenthe
hostisselectedfordownloading),thestatus(complete,indownload,pending),andthepriorityexpressedasthenumberofpendingdownloads.InabsenceofthecompletelistofalreadydiscoveredURLs,however,duplicates(namelyURLsdiscoveredmorethanonce)cannotbedetectedunlessacceptingadegreeoferror,thusthenumberofpendingdownloadscanonlybeestimated.Weproposetwosimpleestimationmethods.The easier approach leverages on the idea that the importance of a website depends on the number ofin-links,thusthenumberofpendingdownloadscanbereplacedwiththehostin-degree.Asanalternative,asmall Bloom filter can be associated to each host. Bloom filters are fast and compact data structures tosketchobjectsetsthatenableprobabilisticmembershipquery.Inourcase,onceanexternallinkisextracted,theBloomfilterassociatedtothecorrespondinghostisqueriedformembership.IftheURLisnotfound,itis added to the bloom filter and the counter of pending downloads is incremented. Despite beingcomputationallymoredemanding thanhost in-degree,bloom filtersare stillmuch lighter thanURL lists.Moreover,accessingthehostqueuehappensonlyforexternallinksthataccountforanaverageof11.62%oftheoutgoinglinksinourexperiments.
2.3. CrawlingDepthAs mentioned in section 2.2, in a real scenario storage and bandwidth are limited and expensive as
opposedtothepossibilityofdynamicallycreatenewwebpageswhichisunlimitedandforfree.Theauthorsof [14]concludethatcrawlingupto five levels fromthehomepage isenoughto includemostof thewebthatisactuallyvisited.Hypothesesonthepower-lowdistributionofthenumberofpagesperwebsite[16],however, suggest that most websites are smaller. Based only upon the common sense, here we try tospeculateonhowthethresholdonthecrawlingdepthcanbesmallforBIapplicationswithsevereresourcelimits. Settingthedepthtotheminimumpossible,acrawlerwouldonlydownloadthewebsitedocumentroot.
Thispage,however,isoftenusedasasortofgatewaytocollectinformationontheuserconfigurationandredirectsthebrowsertotheappropriateversionofthehomepage.Inthiscase,acrawlershouldbeawareofredirectionsandfollowthemuntilreachingtheproperhomepage. Asexpected,homepagesareveryinformativecontainingmostofthebaseinformationaboutthewebsite.
Inparticular,leveragingonlyonhomepagesitispossibletoderivesomestructuralinformationabouttheunderneath technologies [17], accomplish large scale usability tests [18], [19] and perform webclassification[20].Extendingthecrawlingtotwo levelshasa large impact intermsofrequiredstorageandbandwidth. In
fact,homepagesoftentend toact asasortofhub, thus theexpectedvolumeofpagesdirectly reachablefromthehomeisusuallyhigh.Basedonasampleof60amongthemostpopularwebsites,weestimatedthesecondlevelofthewebhierarchytohaveonaverage137pages(Noticethatthiscanbeanunderestimationsincenewswebsitesandgeneral-purposeportalstendtohavemorelinksthantheotherwebsitesandcaneasilyexceed500internal links).However, thehighercost intermsofresources isbalancedbythemuch
Already downloaded
discovered
URL
Url4Not added because
already known
Url2 Url3 Url4 Url5Url1
Next
Newly
Journal of Software
305 Volume 13, Number 5, May 2018
richerinformationcontent.Forexample,ecommercecompaniesmayhavespecifiedonlycommoditymacrocategoriesinthehomepagesleavingthesubdivisionbymicrocategoryorbrandinthesecondlevel.Otherimportant details like contact information and partnerships are often reported in the second level ofhierarchy.Acrawlingdepthofthreepagesisoftensynonymousofdownloadingtheentirewebsite.Infact,inorder
to facilitateaccessibilityaswellas indexing from the searchengines, severalwebsitesusesitemaps.Thistechniqueconsistsincreatingapagedirectlydescendentfromthehomepageinwhichalltheinternallinksofthewebsiteareenumerated.Asaresult,sitemapsboundthewebsitehierarchytothreelevels.
2.4. DataStorageDatastorageisprobablythemostcomplexkeyfactorinacrawlingarchitecture.Nonetheless,evenhigh
impact publicationsgive fewdetails about the storagemodel. For example, in [21] the authors focus onload-balancing, tolerance to failures, and other network aspects; while in [22] the authors discuss thedownloadorderingof thepages, theupdatesandthe content filtering. Inboth cases,however, storage isonlymentioned.A simple classificationof thehugeamountofdataproducedduring crawling is, instead,providedin[23].Crawlingdatacanbeclassifiedaccordingtotwomaincharacteristics:fixed/variablesizeandoutputonly/recurrentusage.Fixedsizedata iseasier to storeandaccess: conventionaldatabases canhandlebillionsof recordsand
serve hundreds of concurrent requests. Variable size data, instead, requires some further considerationaboutitsusagefrequency.Infact,forfrequentlyaccesseddataafastaccessshouldbepreferred,whileforoutput only dataa compact representationwould be better. Frequently, however, fixed and variable sizedataco-existinthesamerecord.Thisisthecase,forexample,oftherecordstoringtheinformationaboutasinglewebpage. In fact,according tothe levelofdetail that the crawlerprovides, this record can containseveralfixedsizeinformation,asforexampleatimestamp,andsomevariablesizeinformationlikethepageURL.However,alsointhiscaseconventionaldatabasesprovidewell-qualifiedsolutions. The most verbose variable size crawling data is the HTML content of webpages. This information,
however, isusedonlyoncetoextractoutgoing links fromthepageandthen it isstoredforoutput.Smallcrawlerssuchaswgetstoreeachpage(orwebsite)inadedicatedfileleavingthefilesystemtheburdenofmanagingthem.Pageupdatesaresupportedbydeletingthecorrespondingfileandreplacingitwithanewone.Obviously,thissolutioncannotscaleevenforsmallcrawlsoffewmillionpages(orwebsites). Anothersimplestoringstrategyisthatofdefiningacertainnumberofmemoryclasses(definingbuckets
offixedsize)andmemorizingeachpageinthebucketthatbetterfitsthepagesize.Eachmemoryclasscanbestoredintoaseparatefile.Thissolutionintroducesacertainwasteofspacethatcanbeminimizedwithacareful choice of the sizes of thememory classes. A simplemethod to setmemory class sizes is that ofdownloadingasmallsampleofpagesandestimatesuitablevaluesfromthem.Updatesaresupportedalsointhiscase.However,thepossibilityforthenewcopyofawebpagetochangesize(thusrequiringabucketofadifferentmemoryclass)canintroduceaformofsegmentationinwhichcertainbucketsbecomeempty.Inordertoavoidthisphenomenon,eachmemoryclasscanbeendowedwithapriorityqueue(i.e.usingamin-heap)totakenoteoftheemptyslots. Asan intermediatestrategybetweenthetwopreviouslydescribedstoragesolutions,whenupdatesare
notrequired,HTMLpagescanbestoredinlogfiles.Onceanewpageisdownloaded,itisstoredjustafterthepreviousoneinthelogfile.Keepingarecordcontainingthelengthandareferencetotheoffsetinthelog file, accessing a webpage is very fast since it only requires a disk seek and a single read operation.Storagespaceisminimizedbecauseonlytwointegersperpagearerequired.Althoughsupportingupdatesis possible, thismodel is recommended only for static crawling. In fact, updating a pagewould requireappending the new copy to the log file and updating the corresponding record. However, the previous
Journal of Software
306 Volume 13, Number 5, May 2018
versionofthepagewouldnotbecancelled,thusresultinginawasteofspace.
Fig.2.Runningtime(inseconds)fortheoperationsof:Download,parsingandretrievingfromdiskof60
amongthemostpopularwebsites.We investigated a further class of data, not discussed in the literature, thatwe call auxiliary data.We
define as auxiliary some piece of information that crawlers had to compute for their activity but theywithdraw immediately after usage. Sometimes, however, the same information is necessary for thesubsequent analysis of the crawled document corpus, thus, storing it during crawling would speed upsubsequentanalysesreducingtheCPUrequirementsatthecostofahigherstorageconsumption.Themostimportant informationbelongingtothiscategory isparsing.Asdiscussed later insection2.4.1,parsing isnecessaryinalldataanalysesandcandominatethecomputationaltime.Savingitatthecostofamoderateincreaseofstorageusagecouldbeusefulinmostofthecases.
2.4.1. ParsingAsmentionedbefore,crawlersperformparsingtoextractoutgoinglinksfromwebpages.However,since
thisisconsideredamereinternaloperation,general-purposecrawlersdonotreturnparsinginoutput,thusforcingsubsequentanalysissoftwaretoredoit. Weperformedasimpleexperimenttoshowhowbigtheimpactofparsingisbothforcrawlinganddata
analysis.Wetestedthehomepageof60popularwebsitesandcomputethespeedfordownloading,parsingandretrievingthenfromdisk.Werepeatedthisoperationtwicesoastomeasuredownloadingspeedbeforeandafternetworkcaching.Asexpected,except for fewexceptions(seeFig.2)parsing is ingeneralmuchfasterthandownloadingandtakesrespectively24.47%oftheoverallprocessingtimeforthefirstdownloadand32.64%forthesecond.Comparingparsingwithdiskaccess(asitwouldbeduringtheanalysis),instead,thesituationdrasticallychanges.Infact,inthiscaseparsingtakesonaverage99%oftheoverallwall-clockprocessingtime.Avoidingrepeatingparsingduringanalyseswouldthusenableaconsistentspeedup.WiththepurposeofBIapplicationsinmind,wedevelopedanultra-fastparserthatstorestheoffsetsof
tagsinavectorofquadruples(seeFig.3).SincethetreestructureofHTMLisrarelynecessaryforanalysis
we do not keep track of it. However, we notice that building it would cost where is the
numberoftagsinthepage(inourexperiment ).Eachtagisdescribedasitsstartingpositionandlength(namely:thedistance inbytesbetweentheopenedandtheclosedanglebrackets). Inaddition, theoffset of the tag nameand its length is stored. In order tominimize the number of disk read operationsneededto loadthewholeparsingdatastructure,wemaintain inthe firstwordof thevectorthesizeofaquadrupleandthenumberoftags.Asaresult,loadingthedatastructurerequiresonlytwodiskaccesses.
0,0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59
1' Download 2' Download Parsing Disk read
O(n log2 n) n
n ≅ 2988
Journal of Software
307 Volume 13, Number 5, May 2018
Fig.3.Graphicaldescriptionoftheparsingdatastructure.
Arelevantproblemofthisapproach(andparsingingeneral)isthatwecannotdoanyassumptiononthe
dimensionsofpagesandtags.Ingeneral,exceptforthelengthofthetagname,whichislimitedto10bytesinHTML5,weshoulduseupto4bytestostoreeachelementofthequadruple.Usingrelativedistancesdoesnotmitigatetheproblem.Thiswouldmeanthatstoringtheinformationofeachtagwouldrequire13bytes.GiventhatourexperimentsindicatethemedianofthelengthofHTMLtagstobe8bytesand56.27%ofthetagstobesmallerthan13bytes, thiswouldbeanunacceptableblowupofrequiredspace. Todealwiththis problem, we use an adaptive strategy in which, after processing a page, the parser computes theminimumnumberofbitsnecessaryforeachfieldofthequadruple(exceptforthetaglengthwhichiskeptconstantto4bits).Thesevaluesarestoredintheheaderwordoftheparsingstructure.Inparticular,weuse
5 bits for the page offset (i.e. limiting themaximum page size to bytes), 4 bits for the tag
length(i.e.limitingthetaglengthto bytes),4bitsforthedistancefromthebeginningoftag.Inordertofittheheaderina32bitword,wereserveforthecounterofthenumberoftagstheremaining19bits.This,however,isnotalimitationinpracticesinceitwouldmeanthatourdatastructurecandealwithpagestoupto524288tags.Weexperimentallyevaluatedtheworst-casenumberofbitsrequiredtostoreeachfield.Weobservedthat
20bits areenough for themaximumpage length,11 for the tag lengthand7 for the tagnamedistance.Overallaquadruplewouldrequireatotalof42bitsintheworstcase.
3. FilteringWhatmakesskepticalBIpractitionersinusingtheWebasasourceofinformationfortheiranalysesisthe
untrustworthiness of sources and possibly the low text quality due to the presence of every sort ofartificiallyaddedcontentinthedata[24](akaboilerplate).Artificialtextisoftenusedtotrytocircumventthe rankingalgorithmsofsearchengines soas toobtainaprivilegedposition in the listof results.Othersourcesofunrelatedcontentare:interfaces(menus,navigationbars,etc.)advertisingbannersandparkeddomains.InordertodiscernwhetheritispossibletousewebdataforBIornot,wethushavetoanswertosome
questions: 1) Areweb-basedcorporaqualitativelycomparablewithtraditionalones? 2) Canweb-basedcorporainfluencetheBIactivity?
In[25],theauthortriestoanswertothefirstquestioncarryingoutacomparisonof“traditional”corporaandweb-basedones,focusingonCzechandSlovak.Experimentsshowthatnoneofthecomparedcorpusismorphosyntactically superior.Web-basedcorporaare simplydifferent.A further insight comes from [26]
Tag name starting position
Memory class size (5,4,4) = 13bits
Number of tags (19bits)
Starting position
Tag length
Tag name length (4 bits)
225
=32≅ 4G
216 = 65536
Journal of Software
308 Volume 13, Number 5, May 2018
where the author shows the effect of the size of corpus. The most interesting result is the empiricaldemonstration that when the dataset is large enough, the language of the web perfectly reflects thestandardwritten language. Lowering the size of the dataset, the equality persists for themost frequentnounswhiledifferencesbegintoappearintheleastfrequentnouns. A partial answer to the second question is provided in [27] where the authors prove that a
computer-mediated text can deceive the reader. However, this category of text has specific linguisticfeaturesthatcanbeeasilydetectedandremoved[28].We observe thatmost of the uninteresting text in web-based corpora comes from twomain sources:
templates (which include advertisement banners, navigational elements, etc.) and spam websites (akaparkeddomains).In[29]weproposedageneral-purposealgorithmfortemplateremoval.Abriefdiscussionof the benefits for BI of our approach as well as a sketch of the algorithm is presented in section 3.1.Differently from approaches like that in [30] our approach is not technology-specific and, thus, it is notsupposed to become deprecated with the evolution of web technologies. Moreover, integrating ourapproach with our crawling ordering described in section 2.2, it is possible to reduce the storagerequirementsofcrawlingstillenablingtorestoretheoriginalwebpages.In[31]weproposedaclustering-basedapproachtodetect themostcommoncategoryofspam:thatof
DNS-based parked domains. Our experiments showed that these websites resolve to dedicated nameservers.Asaresult,filteringatDNSlevelwouldintroduceamarginallossofusefulinformationwhilebeingmuchmoreefficientthanaper-websiteclassification.However,narrowingtoonlythiscategoryofspamisnotenoughforBIapplications.Insection3.2.2weextendtheapproachin[31]toallthecategoriesofspamwebsites.
3.1. TemplateRemoval Since the advent of contentmanagement systems (CMS) in the late nineties, web sites started using
templates.Estimationsin[32]rate48%ofthetopwebsitesasusingCMSs,whilethenumberofsitesthatemploy templates isprobablymuch higher.Web templates consist in stretches of invariableHTML codeinterlaced with the main text of the page. The goal of this technology is that of providing a uniformlook-and-feeltoallthepagesofasiteeasingthewebsitemanagerfromtheheavytaskofmanuallycuratingit.Thus,templatesdonot introduceuseful informationfromtheBIpointofview,buthaveamereroleofimprovingtheuserexperience.However,theimpactoftemplatesintheamountofresourcesusedtostorewebsites isnotnegligible.Accordingto[29]thewastedspaceduetotemplates ishighlyvariableandcanspanfrom22%(estimatedonen.wikipedia.org)to97%(estimatedonzdnet.com). A further undesired effect of web templates is that they have been reported to negatively affect the
accuracyofdataminingtoolsbecauseof thenoisethey introduce inthetext [33]. In fact, largetemplatescan introduce a not marginal number of extra words in the webpage. These words, in turn, can eithermystifyduplicatedetectionalgorithmsorartificiallybiasthesimilarityofdocuments. Consequently, stripping templates from webpages is expected to produce a drastic saving of storage
consumptionaswellasanincreaseofthedataqualityforthesubsequentBIprocess.
3.1.1. AnalgorithmfortemplateremovalInthissectionwesketchoutthemainideasbehindourtemplateextractionalgorithmleavingdetailsto
the original work in [29]. Our algorithm assumes that the template of a website ismade up of a set ofoverrepresentedstretchesofHTMLtokens(namely:atagorapieceoftextbetweentwotags)andconsistsoftwomainroutines:thealignerandabottom-uptemplatedistillationalgorithm.Thealignertakesininputtwo tokenized webpages and uses the result of the Needleman andWunsch global sequence alignmentalgorithm [34] to build a consensus page containing the union of the tags of the two pages. Each tag is
Journal of Software
309 Volume 13, Number 5, May 2018
endowedwithanumericalvaluerepresentingthefrequencyofthetaginthealignment.Startingfromaset
of pages,thetemplatedistillationalgorithm(seefigure4)usesabottomupapproachthat
recursivelyhalvesthenumberofelements in aligningpairsofpages.Once containsonlyonepage,thetagswithestimatedfrequencyhigherthan0.5(i.e.thetagsthatarepresentinatleasthalfoftheoriginalpages)arereturnedasthewebsitetemplate.Fortheeaseofcomputation, isconstrainedtobeapower
of2.
Fig.4.Examplerunofthetemplatedistillationalgorithm.
3.2. SpamWebsitesRemovalAsmentionedinSection3,spammingisoneofthetwomajorsourcesofnoiseinaweb-basedcorpusof
documents for BI applications. Although spam websites no longer look like long flat lists of ads, everyInternetusergatheredexperienceonhowtorecognizethemataglance.Infact,mostspamwebsitestendtolooksimilartoeachother.Exploitingthisfact,in[31]wedevelopedaclustering-basedapproachtoidentifythem. Goingfurther,weobservedthatitispossibletoleverageonthemechanismusedtocreatespamwebsites
onalargescaletopredictwhetherawebsitewouldcontainspamornotwithoutdownloadingit.Infact,bigprovidersofparkingfacilitiesonlyrequireanad-hocsettingofthewebsitenameserver(NS)tosupplytheservice.As shown insection3.2.3, classifyingdomainsaccordingwith theirNSs leadstoanaccurateandcomputationallycheapmethodtoremovespamsavingbandwidthandstorage.
3.2.1. AclusteringalgorithmforspamdetectionIn this section we sketch our approach leaving details to the original work in [31]. According to the
intuitionthatspamwebsitestendtolooksimilar(notequal)toeachother,givenamodelpageandnotionofdistanceamongpages,allspamwebsitesshouldformasinglehomogeneouscluster.However, the reality is a bit more complicated. In fact, parking service companies provide several
templatesthatthecustomercanuseandpersonalize.Consequently,spamwebsitesdonotpartitionintoasingle cluster, but form a large number (as large as the number of templates and customizations) ofhomogeneousandwell-separatedclusters.Distillingandkeepingupdatedacomprehensivelistoftemplatesis unfeasible, thus amodel-based approach is not pursuable. Yet, to save resources a website should beclassifiedbeforedownloadingit,thustheclassificationmechanismshouldbebasedonfeaturesthatcanbe
P = ( p1,..., pt ) tP P
t t
t
Template
Align ()
Align () Align () Align ()
Filter ()
...
1 2
...t−1
Journal of Software
310 Volume 13, Number 5, May 2018
knowninadvance. Basedupontheaboveconsiderations,ouralgorithmtakesininputasetofhomepagesresolvedfromthe
samenameserverandclassifyitasaregularISPorasaDPP(DomainParkingProvider).ThelistofDPPscansubsequentlybeusedduringcrawlingtofilteroutdomainsresolvingtotheseNSssincetheyarelikelytobespam. Asaclusteringalgorithmweusedourownmodifiedversionofthefurther-point-first(FPF)algorithmfor
the k-center problem (namely: find a subset of input elements to use as cluster centers so that tominimizethemaximumclusterradius)originallydescribedin[36].Ourimplementationreturnsthesameresult of the original version, however, it drastically reduces the number of distance computationsleveragingonthetriangularinequalityinmetricspaces.Thus,touseouralgorithmitisnecessarythatthedistancefunctiondefinesametricspace.
Let be an input set of pages and let be the (initially empty) subset of cluster
centers.Thefirstcenterischosenatrandomamongtheelementsin .Then,asanewcenteritisselected
the element such that the distance is maximized. The procedure stops when the
desirednumberofcentersisselected,thentheremainingelementsof areassignedtotheclosestcenter. In its simplest form, the distance function should measure the proportion of tags in common
betweentwopagesgiventheiroveralllength.Keepingtrackoftherelativeorderingofthetagsinthepagesrequires using the quadratic algorithm described in [37]. However, can be estimated with a
computationallylessexpensivefunction that(ignoringtherelativeorderingofthetags)computesthe
Jaccardscoreofthefrequencyvector(avectorthatforeachtagcountsthenumberofoccurrences)ofthepages.Sinceitholdsthat ,ifthevalueof ishighenoughtodecidethatacertainpage
istoofarfromacenter,thecomputationof becomesunnecessary.3.2.2. Detectingnon-NSbasedspamwebsitesThe method described in [31] focuses only on spam websites parked through DNS-level redirection.
Narrowingonlytocrawlingpurposes,thischoiceisagreeablebecauseclassificationcanbedonebeforethedownload, thus preventing wasting space and bandwidth. For BI applications, however, all types ofredirectionsmustbetakenintoaccount.BesidesDNS-based,therearetwootherpossibleredirectiontypes(see[35]fordetails):atHTTPlevelandatHTMLlevel.HTTPredirectionsaredirectlymanagedbywebserversthatreturna3XXclassstatuscodeandatarget
URLwhileHTMLredirectionsaresmallpiecesofcodeaimedatcausingthebrowsernottogenerateoutput,but immediately load another web page following the appropriate URL (i.e. using the tagmeta settingrefresh at zero time, calling a JavaScript function updating either the href or the location variable, orintroducing a full-window iframe tag). Independently from the type, both redirection types cause thebrowsertolandtoatargetURLbelongingtoaDPPwebsite[38].Spam websites using non-DNS redirections can still be detected and filtered out through some
considerationsonthedistributionof in-linksperhost.Accordingtothe famousmodelofwebsites in[39]there exist two important categories ofwebsites: hubs (websiteswith several out-links)andauthorities(websiteswithseveral in-links).Let be thesetofpagesbelonginganauthoritywebsite .Wecan
supposethat isaspamproviderifitsatisfiesthefollowingconditions:
1) Mostofthein-linksof areduetoredirections
2) Thepagesin clusterevenly.
Condition1preventsfromconsideringspamthosepagesthatlinktopopularservices,whilecondition2
k
P C ={c1,...ck )⊂ PP
p ∈ P minjd( p,c j )
Pd()
d()f ()
f (x, y) ≤ d(x, y) f ()d()
C(γ ) γ
γ
γ
C(γ )
Journal of Software
311 Volume 13, Number 5, May 2018
isthesamehypothesisatthebaseofourapproach.
Fig.5.NumberofhostsperNameServer(NS).Thebluebarindicatesregularwebsites,theredbarindicates
parkeddomains.TheredboxesonthexaxerepresentDPPNSs,theblackboxamisclassifiedNS.
3.2.3. EvaluationWeperformed an experiment to quantify: the fraction of relevant pages that would be lost using our
approach and the volume of irrelevant pages that would be wrongly left in the corpus.We analyzed acrawling of about 2million validweb-pages hostedbymore than 100 thousand differentprimary nameservers(NS)anddistributedamongthefourTLDs“.it”,“.com”,“.org”and“.net”.WethanfocusedonthemostrepresentedNS,whichare, indeed, thosethatmore likelycouldhostspamwebsites,removingthosewithless than 1000 pages. We obtained a corpus of 541,138 valid home pages distributed among 38 nameservers.Subsequently,weusedthecomputer-aidedapproachdescribedin[31]toclassifythehomepagesintwocategories: irrelevant/relevant. As figure5confirms,mostNSs(weobfuscatedthe fullnamesforprivacy reasons) focus only on one business, either regular hosting or domain parking, with only fournotable exceptions: NS10, NS11, NS19 and NS36. NS10 and NS19 are two medium-size private NSsbelongingtotwoItalianTVproductioncompaniesthatregisteredalargenumber(respectively87.15%and89.19%) of dictionary-based domainnames for own future use.NS11 andNS36, instead, are specializedcompaniesfordomainparkingthatalsohostsafraction(6.23%and11.43%)ofownregularpages.NoticethatNS11istheonlyNSmisclassifiedusingourmethod.
Table1.Per-WebsiteConfusionMatrix
HumanClassification
TotalRelevant Irrelevant
Algorithm Relevant 430,676 25,464 456,140Irrelevant 253 84,745 84,998
Total 430,929 110,209 541,138
In Table 1we show in detail the benefits of using ourmethod to filter out spamwebsites.Out of the25,464irrelevantpagesmisclassified,7408comesfromNS10and4897comesfromNS19.Althoughthesepagesare indeednot interesting for thehumanjudge, they cannotproperlybe consideredas spamsincetheyare not simple lists of ads but still contain human-generated text. Only 7118 pages can directly beascribedtosoftwaremisclassification. Thefractionofrelevantcontentthatwouldbelostisnegligibleandconsistsonlyin253pages.Thisresult
is particularly important for the class BI applications in which the discovery of not yet emerging newphenomenaiscrucial.
0
1
10
100
1.000
10.000
100.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
ISP DPP
Journal of Software
312 Volume 13, Number 5, May 2018
As a further important observation, sinceaNS-based classification doesnot require downloadingandstoringwebpages, in the caseofourexperiment theelementsof theNSs classifiedasDPPwouldnotbedownloaded,thussaving15.7%ofcrawlingresources.Summarizing: experiments show thatusing theNSto classifywhether the correspondingwebsitesare
regular or spam provides a reasonable accuracy (as high as 95.25%) as well as it is computationallyconvenient.
4. ClassificationLettingageneral-purposecrawlerloose,itwillstartdownloadingthewebwithoutdiscriminatingamong
topics until stopped or until its storage/bandwidth resources are completely exhausted. The resultingcorpuswillconsistinahugemassofdocumentsspreadingonalmostallthehumanknowledge.However,sincetypicalBIapplicationsneedtofocusonlyondocumentsabouttheinvestigatedtopic,amechanismtofilteroutunnecessarywebsitesisnecessary. Filtering can beappliedeitherat crawling time (focused crawling) or ex post. In the first case, oncea
website is discovered, the crawler classifies it and decides whether to download it or not. The mostimportantadvantageofthisstrategyisthatofsavingresources.Infact,ifawebsiteisclassifiedasofftopic,it isneither crawlednor stored.Ontheotherhand,discardingawebsite could cause links to interestingpagestobeirreparablylost.Aposteriorifilteringismuchmoreresourcedemandingthanfocusedcrawlingand should bepreferred only in those situations inwhich the application requires to crawla sampleascomplete as possible. Independently from the approach, however, a mechanism to classify websites isnecessary.
4.1. CreationoftheTrainingSetSupervisedclassification is thetaskofassigningadocumenttoaclassbasedonpre-existingexamples.
This tool has successfully been used in focused crawling aswell as in generalwebsite classification. Thehunger of classifiers for training examples, however, poses severe limitations to the practical usage ofsupervised classification. In fact, buildinga large-enoughhighquality training set can requirea longandtediousmanualclassificationbyadomainexpert.Inordertoreducehumaninvolvementinthelabelingofexamples,someauthorsproposedapproachesthatleverageonexistingknowledgebases. In[40]and[41]the authors use ontologies to drive crawling. In particular, in [40] the authors show how to exploit thenewlydiscovereddocumentstoextendtheontologyobtainingan increasinglymoreaccuratetrainingset;whilein[41]theauthorsdescribeaframeworktocomputetherelevanceofadocumentwithanontologyentity. In[42]theauthorsexperimentallyprovethatwebsiteclassificationcanbedonewithoutscarifyingaccuracybutonlyusingpositiveexamples.Allthoseworks,however,haveincommontheassumptionthatsuchformofdomainknowledgeexistsandisavailable.Theaboveassumptionisacceptabledealingwithgeneraltopicsthataremorelikelytobecoveredbysuch
a knowledge base. Small andmedium enterprises, instead,may be interested in uncommon specializedtopics not enough represented in the existing freely available datasets. Using clustering in place of asupervisedapproachwouldmeantradingtheclassificationaccuracywiththesimplicityandubiquityofthemethodology.In [20] we faced the problem of using a small description of a class (as small as a weighted list of
keywords)toautomaticallyderivefromtheWebalistofpagesthataremorelikelytobegoodexamplestotraina classifier.Asmost semi-supervisedmethods,ourapproachattempts tobalance theuseofhumanresources with the decrease of classification accuracy. In fact, to the easy task of compiling a list ofkeywordscorrespondsthe introductionofsomemisclassifiedexamplesand, inturn,aslightreductionofaccuracy.
Journal of Software
313 Volume 13, Number 5, May 2018
Fig.6.Graphicalrepresentationofthelabelingpipeline.
4.1.1. AsemisupervisedpipelinefortheextractionoftrainingexamplesInthissectionwesketchoutourpipelineleavingdetailstotheoriginalworkin[20].Boththeplethoraof
informalguidesonhowtochooseadomainnameandthescientificliteratureonthepsychologicaleffectsofbrandnamesagreethatasuccessfulbrandshouldconsistinthejuxtapositionoffewwords.Inparticular,in[43]theauthorsperformanexperimentinwhichtheydiscusstheeffectsofusingmeaningfulbrandnames.Results showhownames conveying the benefits of theproduct (i.e. smartphone) achieve a commercialadvantageonotherbrands.Goingastepfurther,in[44]theauthorsanalyzethelinguisticcharacteristicsofbrandnames concluding that semantic appositeness is one of the threemain features that contribute tomakethebrandfamiliarand,inturn,successful.Althoughpopularbrandsoftheneteconomycoinedinthemid-ninetiesrepresenteminentexceptionstotheabovestudies,itisrealistictohypothesizetheabundanceofdomainnamesconsistinginmeaningfulshortsentencessemanticallyrelatedtothewebsitecontent.
Givenaset of classess,eachofwhichdescribedasadocument,andaURLtokenizedasa
listofkeywords,ourapproachmapstheproblemofassigningawebsitetoaclassastheproblemofrankingclassesaccordingtotheirrelevancetotheURLtext.Figure6graphicallydepicts thesequenceofstepsofourpipeline. Weusedadictionary-basedgreedyalgorithmtodistilla listofwords fromaURL.Thedomainnameis
scannedfromlefttorightuntiladictionarywordisfound,thenthewordisextractedandtheprocedureisrecoursivelyinvokedontheramainingpartoftheURL.Inordertoavoidmisleadingortrivialtokenizations(i.e.“anacid”insteadof“anacid”),ouralgorithmsearcheswordsinlengthdecreasingorder.Iftheproceduredoesnotsucceedtouseall the charactersof thedomain itbacktracks totestdifferentalternatives. Ifnocompletetokenizationisfound,theURLisdiscardedandthecorrespondingwebsiteisnolongerconsideredasapotentialcandidatetobeaddedtothetrainingset.URLlabeling isdoneusinga learningtorankapproach[45].This frameworkprovidesabackgroundto
solve theproblemofsortinga setofdocumentsaccording toapoolofmeasuresof relevancetoaquery.Eachrelevancemeasureinducesarankingthatisaveragedwiththeothersinorderproduceafinalranking.Inourcase,weusedtheclassdescriptionsinplaceofthesetofdocumentsandthetextextractedfromtheURLas the query. Relecance is computed through: cosine similarity, generalized Jaccard coefficient [46],dicecoefficient,overlapcoefficient,WeightedCoordinationMatchSimilarity[47],BM25[48],andLMIRDIR[49].Aslabelwechosetheclasswithhighestranking,whileasadegreeofmembershipweusedtheaverageoftherelevancemeasuresnormalizedtotherange[0,1]. Inordertominimizethenumberofmislabelledexamples,andinturnbiasthesubsequentlearning,the
labellingpipeline implementstwofilters:after tokenizationandafterranking. Thefirst filterconsists inmeasuringthedegreeofpage/URLcoherence:ouralgorithmcomputesthecosinesimilaritybetweenthetextderivedfromtheURLandthepagecontent, then itverifiesthat thismeasureexceedsauser-defined
Discard
n
C1P
age
Page/URL
coherence
Ran
kin
g
Quality
filter
description
Class
description
Class
...
Relevane
RelevanceURL Tokenizer
Class
Discard
< t
hre
shold
Classified
example URLs
>=
thresh
old<
thresh
old
C
C ={c1,...cn} n
Journal of Software
314 Volume 13, Number 5, May 2018
threshold.Thesecondfilterensuresthattheclassassignmentisneitherweaknorambiguous.Inparticular,weakness ismeasured as the inverse degree ofmembershipwith the labelling class, while ambiguity isevaluatedasthenumberofclassessuchthatthedegreeofmembershipexceedsapre-definedthreshold.Experimentsin[20]over20,016webpagesbelongingto9non-overlappingcategoriesshowthat32.36%
oftheURLssucceedtobetokenizedand26.33%ofthepagesarelabeledwithanaccuracyof85.8%.
5. ConclusionIn thispaperweaddressed theproblemof enablingBI andbigdataanalyticsalso for those smalland
mediumenterprisesthat,becauseof their limitedbudgetandITresources,untilnowhavebeenexcludedfromthebenefitsoftheseactivities.Inparticular,weobservedthat,despitemorphologicallydifferentfromstandard text corpora,using theWebasa sourceof informationdoesnotnecessarily leadto lowqualitydata. In fact,applyingconvenienttemplateremoval filtersaswellasspamfilters, it ispossibletoremoveundesired text/pages without altering the useful informative content of the Web. We also showed howcareful design choices can contribute to drastically reduce the cost of crawling, and in turn, allow todownloadalarge-enoughportionoftheWebsoastomakeBIstatisticallysignificant.Finally,wedealtwiththeproblemofper-topicwebsiteclassificationshowingthatagoodtradeoffbetweenthehumaneffortofcreatingamanualdescriptionof theclassesandtheclassificationaccuracy ispossible.Summarizing:ourexperienceshowsthataccurateresource-drivendesignchoicescanleadtoasatisfactoryBIactivityalsoforsmallandmediumbusinesseswithoutsacrificingquality.
AcknowledgmentThisworkwassupportedby:theRegioneToscanaofItalyunderthegrantPORCRO2007/2013AsseIV
CapitaleUmanoandtheItalianRegistryoftheccTLD.it"Registro.it.
References[1] Luhn,H.P.(1958).Abusinessintelligencesystem.IBMJ.Res.Dev,314-319. [2] Hsinchun,C.,Chiang,R.H.L.,&Storey,V.C.(2012).Businessintelligenceandanalytics:Frombigdatato
bigimpact.MISquarterly36.[3] Hu,H.,Wen,Y,Chua,T.,&Li,X. (2014).Towardscalablesystemsforbigdataanalytics:Atechnology
tutorial.IEEEAccess.Vol.2,652-687. [4] Kumar,M., Bhatia, R., &RattanD. (2017). A survey ofWeb crawlers for information retrieval.Wiley
InterdisciplinaryReviews:DataMiningandKnowledgeDiscovery. [5] Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., &Wiener, J.
(2000).Graphstructureintheweb.ComputerNetworks,33(1),309-320. [6] Rowlands, T., Hawking, D., & Sankaranarayana. R. (2010). New-web search with microblog
annotations.Proceedingsofthe19thInternationalConferenceonWorldWideWeb. [7] Aiello,L.M.,Petkos,G.,Martin,C.,Corney,D.,Papadopoulos,S.,Skraba,R.,Oker,A.,Kompatsiaris,I.,&
Jaimes,A.(2013).SensingtrendingtopicsinTwitter.IEEETrans.onMultimedia. [8] Wang,D.,Navathe,S.B.,Liu,L.,Irani,D.,Tamersoy,A.,&Pu,C.(2013).Clicktrafficanalysisofshorturl
spamontwitter.Proceedingsofthe9thInt.Conf.onCollaborativeComputing:Networking,ApplicationsandWorksharing(Collaboratecom).
[9] Cho, J., Garcia-Molina H., & Page, L. (1998). Efficient crawling through URL ordering. ComputerNetworksandISDNSystems.
[10] Najork, M., &Wiener, J. L. Breadth-first crawling yields high-quality pages. Proceedings of the 10thinternationalconferenceonWorldWideWeb(WWW'01).
Journal of Software
315 Volume 13, Number 5, May 2018
[11] Baeza-Yates, R., & Castillo, C., Marin, M., & Rodriguez, A. Crawling a country: Better strategies thanbreadth-firstforwebpageordering.ProceedingsoftheSpecialinteresttracksandpostersofthe14thInt.Conf.onWorldWideWeb(WWW'05).
[12] Castillo,C.,Marin,M.,Rodrıguez,A.,&Baeza-Yates,R.(2004).SchedulingalgorithmsforWebcrawling.[13] Fielding,R.,Gettys,J.,Mogul,J.,Frystyk,H.,Masinter,L.,Leach,P.,&Berners-Lee,T.(1999).RFC2616-
HTTP/1.1,thehypertexttransferprotocol.http://w3.org/Protocols/rfc2616/rfc2616.html[14] Baeza-Yates,R.,&Castillo,C.(2004).Crawlingtheinfiniteweb:Fivelevelsareenough.Algorithmsand
ModelsfortheWeb-Graph. [15] Iyengar,A.K.,Squillante,M.S.,&Zhang,L.(1999).Analysisandcharacterizationoflarge¾ ScaleWeb
serveraccesspatternsandperformance. [16] AdamicL.A.,&Huberman,B.A.(2001).TheWeb'shiddenorder.Commun.[17] Gomes, D., Nogueira, A., Miranda, J., & Costa, M. (2009). Introducing the Portuguese web archive
initiative.In8thInternationalWebArchivingWorkshop. [18] William, A., & Tullis, T. (2013).Measuring the userexperience: collecting, analyzing, and presenting
usabilitymetrics.Newnes.[19] Lopes, R., Gomes, D., & Carriço, L. (2010).Web not for all: A large scale study of web accessibility.
ProceedingsoftheInt.CrossDisciplinaryConferenceonWebAccessibility. [20] Geraci,F.,&Papini,T.(2017).Approximatingmulti-classtextclassificationviaautomaticgenerationof
trainingexamples.Proceedingsof the18th InternationalConferenceonComputationalLinguisticsandIntelligentTextProcessing.
[21] Boldi, P., Codenotti, B., Santini, M., & Vigna, P. (2004). Ubicrawler: A scalable fully distributed webcrawler.Software:PracticeandExperience.
[22] Olston,C.,&Najork,M.(2010).Webcrawling.FoundationsandTrends®inInformationRetrieval4.3. [23] Felicioli, C.., Geraci, F.., & Pellegrini, M. (2011). Medium sized crawlingmade fast and easy through
Lumbricuswebis.Int.Conf.onMachineLearningandCybernetics. [24] Gyongyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. 1st Int. Workshop on Adversarial
InformationRetrievalontheWebAIRWeb. [25] Benko,V.(2017).Arewebcorporainferior?ThecaseofCzechandSlovak.ProceedingsoftheWorkshop
onChallengesintheManagementofLargeCorporaandBigDataandNaturalLanguageProcessing. [26] Khokhlova, M. (2016). Large corpora and frequency nouns. Proceedings of the Int. Conf. on
ComputationalLinguisticsandIntellectualTechnologies:“Dialogue2016.[27] Zhou,L.,&Burgoon,J.K.,Nunamaker,J.F.,&Twitchell,D.(2004).Automatinglinguistics-basedcuesfor
detectingdeception intext-basedasynchronous computer-mediated communications.GroupdecisionandNegotiation.
[28] Piskorski, J., Sydow,M., &Weiss, D. (2008). Exploring linguistic features for web spam detection: Apreliminarystudy.Proceedingsof the4thInternationalWorkshoponAdversarial InformationRetrievalontheWeb(AIRWeb'08).
[29] Geraci, F., & Maggini, M. (2011). A fast method for web template extraction via a multi-sequencealignment approach. International Joint Conference on Knowledge Discovery, Knowledge Engineering,andKnowledgeManagement.
[30] Schafer, R. (2017). Accurate and efficient general-purpose boilerplate detection for crawled webcorpora.LanguageResourcesandEvaluation,51(3),873-889.
[31] Geraci,F. (2015). Identificationofwebspamthroughclusteringofwebsite structures.Proceedingsofthe24thInternationalConferenceonWorldWideWeb.
[32] W3Techs, Usage of content management systems for websites.
Journal of Software
316 Volume 13, Number 5, May 2018
https://w3techs.com/technologies/overview/content_management/all/[33] Martin,L.,&Gottron,T.(2012).ReadabilityandtheWeb.FutureInternet4.1. [34] Needleman,S.B.,&Wunsch,C.D.(1970).Ageneralmethodapplicabletothesearchforsimilaritiesin
theaminoacidsequenceoftwoproteins.JournalofMolecularBiology. [35] Almishari, M., & Yang, X. (2010). Ads-portal domains: Identification andmeasurements.ACM Trans.
Web,4(2). [36] Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. In Theoretical
ComputerScience. [37] Myers,E.W.(1986).AnO(ND)differencealgorithmanditsvariations.Algorithmica1.1(1986).[38] Li, Z.,Alrwais, S., Xie,Y.,Yu,F.,&Wang,X. (2013).Finding the linchpinsof thedarkweb: a studyon
topologically dedicated hosts on malicious web infrastructures. IEEE Symposium on Security andPrivacy.
[39] Kleinberg,J.M.(1999).Authoritativesourcesinahyperlinkedenvironment.JournalofACM. [40] Luong,H.P.,Gauch,S.,&Wang,Q.(2009).Ontology-basedfocusedcrawling.ProceedingsoftheInt.Conf.
onInformation,Process,andKnowledgeManagement. [41] Ehrig,M.,&Maedche,A.(2003).Ontology-focusedcrawlingofWebdocuments.Proceedingsofthe2003
ACMsymposiumonAppliedcomputing(SAC'03). [42] Yu,H., Han, J., & Chang, K. C. (2004). Pebl:Webpage classificationwithout negative examples. IEEE
TransactionsonKnowledgeandDataEngineering,16(1).[43] Keller, K. L., Heckler, S. E., & Houston, M. J. (1998). The effects of brand name suggestiveness on
advertisingrecall.TheJournalofMarketing. [44] Lowrey, T. M., Shrum, L. J., & Dubitsky, T. M. (2003). The relation between brand-name linguistic
characteristicsandbrand-namememory.JournalofAdvertising. [45] Hang,L.(2011).Ashortintroductiontolearningtorank.IEICETRANS.onInformationandSystems. [46] Charikar,M.S. (2002).Similarityestimationtechniques fromroundingalgorithms.Proceedingsofthe
Thiry-FourthAnnualACMSymposiumonTheoryofComputing. [47] Wilkinson,R.,Zobel,J.,&Sacks-Davis,R.(1995).Similaritymeasuresforshortqueries. [48] Robertson,S.E.(1997).Overviewoftheokapiprojects.JournalofDocumentation. [49] Bennett,G.,Scholer,F.,&Uitdenbogerd,A.(2008).Acomparativestudyofprobabilisticand language
modelsforinformationretrieval.Proceedingsofthe19thCconf.onAustralasianDatabase.
LoredanaM.Genovesewasborn inCrotone, Italy, in June1971.Shereceivedhermasterdegree inComputerScience from theUniversityofPisa in2002witha thesisonparallelreplicateddatabases.Shewasalsoawardeda``laureaspecialistica''degreeinInformaticsin2004.Afterseveralyearsspentonaprivate internetcompany,shejoinedtheInstitute forInformatics and Telematics of the Italian National Research Council in 2013 with thepositionofresearchassistant.
Filippo Geraci was born in Sicily, Italy, in September 1977. He received his degree incomputersciencefromtheUniversityofPisaandhisPh.D.ininformationengineeringfromthe University of Siena. He is currently researcher at the Institute for Informatics andTelematicsoftheItalianNationalResearchCouncilinPisa.HeheldthechairofthecourseofInformationsystemsforbusinessmanagementattheUniversityofSiena.Heistutorofthe
courseofAlgorithmDesignat theUniversityofPisa.His research fieldsare:algorithms for theweb,andbioinformatics.