Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in...
Transcript of Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in...
©2016FutureTDM|Horizon2020|GARRI-3-2014–665940
REDUCING BARRIERS AND INCREASING UPTAKE OF TEXT AND DATA MINING FOR RESEARCH
ENVIRONMENTSUSINGACOLLABORATIVEKNOWLEDGEANDOPENINFORMATIONAPPROACH
DeliverableD3.1
ResearchReportonTDMLandscapein
Europe
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
2
Project
Acronym: FutureTDM
Title: Reducing Barriers and Increasing Uptake of Text and Data Mining for Research
EnvironmentsusingaCollaborativeKnowledgeandOpenInformationApproach
Coordinator: SYNYOGmbH
Reference: 665940
Type: Collaborativeproject
Programme: HORIZON2020
Theme: GARRI-3-2014-ScientificInformationintheDigitalAge:TextandDataMining(TDM)
Start: 01September,2015
Duration: 24months
Website: http://www.futuretdm.eu
E-Mail: [email protected]
Consortium: SYNYOGmbH,Research&DevelopmentDepartment,Austria,(SYNYO)
StichtingLIBER,TheNetherlands,(LIBER)
OpenKnowledge,UK,(OK/CM)
RadboudUniversity,CentreforLanguageStudies,TheNetherlands,(RU)
TheBritishLibraryBoard,UK,(BL)
UniversiteitvanAmsterdam,Inst.forInformationLaw,TheNetherlands,(UVA)
Athena Research and Innovation Centre in Information, Communication and
Knowledge Technologies, Inst. for Language and Speech Processing, Greece,
(ARC)
UbiquityPressLimited,UK,(UP)
FundacjaProjekt:Polska,Poland,(FPP)
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
3
Deliverable
Number: D3.1
Title: ResearchreportonTDMLandscapeinEurope
Leadbeneficiary: RU
Workpackage: WP3:ASSESS:Studies,Publications,LegalRegulations,PoliciesandBarriers
Disseminationlevel: Public(PU)
Nature: Report(RE)
Duedate: 31.01.2016
Submissiondate: 29.04.2016
Authors: MariaEskevich,RU
AntalvandenBosch,RU
Contributors: MarcoCaspers,UVA
LucieGuibault,UVA
AlessioBertone,SYNYO
SusanReilly,LIBER
CarmenMunteanu,SYNYO
PeterLeitner,SYNYO
SteliosPiperidis,ARC
Review: MarcoCaspers,UVA
LucieGuibault,UVA
AlessioBertone,SYNYO
Acknowledgement: This project has received
funding from the European Union’s Horizon 2020
Research and Innovation Programme under Grant
AgreementNo665940.
Disclaimer: The content of this publication is the
sole responsibility of the authors, and does not in
any way represent the view of the European
Commissionoritsservices.
ThisreportbyFutureTDMConsortiummemberscan
bereusedundertheCC-BY4.0license
(https://creativecommons.org/licenses/by/4.0/).
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
4
TableofContents
1 Introduction....................................................................................................................................6
1.1.Whatistextanddatamining?.....................................................................................................6
1.2.ThegrowingimportanceofTDM.................................................................................................7
1.3.Goalandstructureofthereport..................................................................................................9
2 TDMframework............................................................................................................................11
2.1.Datatobemined........................................................................................................................11
2.2.TechnologyandR&D..................................................................................................................16
3 TDMacrosseconomicsectors.......................................................................................................21
3.1.Primarysector............................................................................................................................22
3.2.Secondarysector........................................................................................................................22
3.3.Tertiarysector............................................................................................................................23
3.4.Quaternarysector......................................................................................................................25
3.5.Quinarysector............................................................................................................................26
4 Summaryoftheliterature.............................................................................................................27
5 Conclusion.....................................................................................................................................40
References.............................................................................................................................................41
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
5
LISTOFFIGURES
Figure1.NumberofdifferenttypesofpublicationsintheDBLPComputerSciencebibliography........8
Figure2.NumberofdifferenttypesofpublicationsontheEuropeanLifeSciencesPMCwebsite........9
Figure3.MacroviewofTDMframework.............................................................................................11
Figure4.MicroviewonTDMenvironment..........................................................................................17
Figure5.GeneraleconomicstructureandconnectionbetweenTDMandalleconomicsectors........22
Figure6.Timelineofreports.................................................................................................................29
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
6
1.INTRODUCTION
Recentyearshaveseenanexponentialgrowthinthevolumeofdigitaldataavailabletoalllevelsof
society.Afurtherproliferationofdigitalordigitisedcontentisexpectedinthis,theeraofBigData.A
numberofinterrelatedfactorshaveboostedthisdatathrustandtheaccompanyingdevelopmentof
technologiestoexploitthisnewtypeofinformation-richgroundmaterialwithtextanddatamining
(TDM). First, the costs for data storage and cloud computing facilities continue to decrease,while
computing power increases. Second, digitisation has moved from specific processes, such as
archiving and optimizing industrialworkflows, to nearly all daily aspects of societal activities. This
processpavesthewayfordataficationofthisinformation,i.e.thecreationofnewinformationand
knowledge, potentially with new value, on the basis of machine-readable content. TDM is the
umbrellatermfortechnologiesthatenablethisgoal.
Thecontinuousgrowthofdata1increasinglyemphasisesthequestionofhowtobestmakeuseofit.
Societyexpects,withgoodreason,thatthe implementationof intelligentmethodstodealwithBIg
Datawill leadtoimprovedinformationaccess,fasterknowledgediscovery,higherproductivity,and
competitiveness.TheseexpectationshaveinspiredTDMresearchersandcompaniesacrosstheworld
to explore novel algorithms for extracting information, discovering knowledge, and improving the
ability topredict fromdataona large scale.Asdata is amanifold concept, so is TDM.This report
aimstochartandstructuretheTDMlandscape.
1.1.WhatisTextandDataMining?
Text and Data Mining was initially defined as “the discovery by computer of new, previously
unknown information, by automatically extracting and relating information from different (…)
resources,torevealotherwisehiddenmeanings”(Hearst,1999),inotherwords,“anexploratorydata
analysisthatleadstothediscoveryofheretoforeunknowninformation,ortoanswersforquestions
for which the answer is not currently known” (Hearst, 1999). Nowadays, this umbrella term
encompassesdiversetechniquesthatallow interpretationofcontentofanytyperangingfromraw
data, e.g. sensor data, text, images and multimedia, to processed content, e.g. diagrams, charts,
tables,references,maps,formulas,chemicalstructures,andmetadatafromsemi-structuredsources,
onalargescalethroughtheidentificationofpatterns.
TDM algorithms harbour vast potential for nearly all scientific fields and for a wide diversity of
practical industrial and societal applications. However, this broad potential and variety of TDM
applications complicates the landscape view, as even the technology itself is referred to using
different terms within different fields. Within the business and managerial context, TDM can be
referred to asBusiness Intelligence Solutions orQualitative Data Analysis. To highlight the central1Itisexpectedthattherewillbe16trilliongigabytesofdataby2020,whichcorrespondstoanannualgrowthrateof236% indatageneration (Motion fora resolution further toQuestion forOralAnswerB8-0116/2016pursuant to Rule 128(5) of the Rules of Procedure on “Towards a thriving data-driven economy”(2015/2612(RSP),2016):http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
7
positionofvastquantitiesofdata,thetermBigData(Processing)isused.Whenthefocusisonthe
discovery of hitherto undiscovered information, Content Mining appears to be a term of choice,
whileDataAnalyticsandTextAnalyticsemphasisethedata-drivenaspectofperforminganalysesand
afocusonseekinganalyticalsolutionstochallenges.Intheacademicworld,TDMisoftenconsidered
tobecomposedofMachineLearningmethodscoupledwithExploratoryDataAnalysismethodssuch
as visualisation.Machine Learning, in turn, arose from thepost-war fields ofArtificial Intelligence,InformationTheory,andPatternRecognition.
1.2.TheGrowingImportanceofTDM
Withtheincreaseofcomputationalpower,theriseofBigData,andthedigitisationofvastamounts
ofdata,theuseofTDMpromisestoyieldvaluablesocietalandeconomicbenefits intermsofnew
insights and cost savings that would otherwise not be possible: more scientific breakthroughs, a
greaterunderstandingof society, and countlessbusinessopportunities. Thepotential of TDMas a
researchtechniqueandasaneconomicopportunityhasbeenrecognisedatapoliticallevelintheEU
(HargreavesI.etal.,2014).2ThebenefitsofTDMhavealsobeennotedbytheresearchcommunity
andsocietalstakeholderse.g.viatheHagueDeclaration3.
Originallyrootedinacademicresearch,TDMhasleftitsacademicchildhoodyearsandisnowaglobal
informationtechnology(IT)industry.Asanindustry,TDMisfuelledbydirectaccesstolargeamounts
ofdata.CompaniescaneitherdevelopandruntheirownTDMsoftware,oroutsourcethisworkto
external experts who can customise TDM solutions. Industry is now rapidly rolling out the
implementationofTDMforBigData,ascomputingpowergrowsandbusinesses findnewways to
gather and access data. According to the International Data Corporation’s Worldwide Big Data
TechnologyandServicesForecastfor2013-2017,thegrowthrateoftheBigDatamarketuntil2017
willbesixtimesfasterthanthatoftheoverallinformationandcommunication(ICT)market.TheBig
DataValuePublic-PrivatePartnership2predictsthatTDMwillreachatotalofEUR50billionby2017,
andmayresult in3.75millionnewjobs.WhilemanycompaniesseethepotentialofTDMfortheir
business development, only 1.7% of companies are currentlymaking full use of advanced digital
technologieslikeTDM,despitethebenefitsthatdigitaltoolscanbring.2
TheapplicationofTDMtoscienceholdsthepromiseofacceleratingscientificdiscoveries.Miningthe
contentpresent in researchpublicationdatabases and theassociated researchdata sets, as far as
they are available for harvesting, will boost scientific research by increasing the productivity of
researchers and leaving part of the scientific discovery process to computers, a trend calledDataScience. The size of scientific publication databases varies across domains, but in each case, the
number is beyond a value thatwould be feasible for a human to read and assesswithout digital
support. Figures 1-2 give examples of the volume of content that has to be dealt with by the
2 (Motion for a resolution further toQuestion forOralAnswerB8-0116/2016pursuant toRule 128(5) of theRules of Procedure on “Towards a thriving data-driven economy” (2015/2612(RSP), 2016):http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN;http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_expert_group-042014.pdf3TheHagueDeclarationonKnowledgeDiscovery:http://thehaguedeclaration.com/the-hague-declaration-on-knowledge-discovery-in-the-digital-age/
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
8
academia and industry researchers in the Computer Science4 and Life Sciences domains5,
respectively. TDMmay help researchers in all scientific fields to regain control over the academic
process,toaggregateinformationandtofollowtrendsacrossfieldsbeyondtheirown.Estimationsof
addedeconomicvalueinthiscase,basedonthetimesavingbenefitofTDMalone,areanincreaseof
2percent,orEUR5.3billion,withlong-termimpactofoverallgainintherangeofatleastEUR32.5
billion(Filippov,2014).
Figure1.NumberofdifferenttypesofpublicationsintheDBLPComputerSciencebibliography.6
4http://dblp2.uni-trier.de5https://europepmc.org6http://dblp2.uni-trier.de/statistics/publicationsperyear.html[accessedonMarch2016]
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
9
Figure2.NumberofdifferenttypesofpublicationsontheEuropeanLifeSciencesPMCwebsite.7
1.3.Goalandstructureofthereport
Inthisreport,weoutlinethetextanddatamininglandscape.Westartbyprovidinganoverviewof
types of data and technologies that the TDM experts are working with. We look at the global
economicstructureoftheknowledge-basedsocietyandhighlightusecasesforTDMuptakeacrossall
economic sectors. Following recentdiscussionsonTDMperspectiveswithin the scientific,political,
andeconomiccontext,wehighlighttheimportanceofthetechnologyanditspotential.
TDMalgorithmsaredevelopedglobally. It is thereforeour task to list and structure the important
technology definitions, data types, underlying algorithms and leading research communities on a
worldwide scale. However, the actual uptake of TDM and ease of practical implementation vary
considerablydependingonthecountryandtheeconomicsectorinquestion.Inthisreport,wegivea
broadperspectiveon the general EuropeanUnion (EU) landscape, andquoteexceptional country-
leveldirectivesonlywhenthesecountrieshavelegalregulationsspecificallyaddressingTDM,e.g.the
caseofstatutoryexceptiononcopyrightintheUK8.
TDM is becoming increasingly ubiquitous on the global market, as each company that tracks its
activities,products,andcommunicationsinsomedigitalformhasthepotentialtoreusethisdatafor
furtheranalysisandimprovement.Forsomesectors,countriesandapplications,itisalreadypossible
toestimate thepotential profit or at least the sizeof themarket.Other fields suchas journalism,
7https://europepmc.org/About[accessedonMarch2016]8 "Section 29A and Schedule 2(2)1D of the Copyright, Designs and Patents Act 1988":http://www.legislation.gov.uk/uksi/2014/1372/regulation/3/made
Abstracts30.9million,25.9millionfromPubMed
Patents4.2million
Agricolarecords617244
Fulltextarticles3.6million
NHSguidelines765
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
10
which is in the process of embracing the concept ofData Journalism, have to first agree on the
conceptualchangesrequiredtotheircurrentworkbehaviourinordertomergethetraditionalwork
patternwithTDM(LewisS.C.andWestlundO.,2014).
This report is structured as follows: in Section 2,we introduce the definitions of the overall TDM
framework, classifications of data types, and the technologies that can be used in text and data
miningprocess.InSection3,weshowhowTDMispotentiallyrelatedtoallsectorsofeconomics,and
we illustrate this assumptionwith examples and potential use cases of TDM implementation and
integration across sectors. In Section 4, we draw a timeline of scientific, industrial and political
documents thatemphasise the importanceofand interest inTDMand theneed foruptakeacross
sectors,andwesummarisethekeymessages.WeconcludethereportwithSection5,outliningthe
importanceofTDMforthefutureoftheeconomyandscience.
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
11
2.TDMFRAMEWORK
Thequantityofdigitalcontentcurrentlyproducedbysocietyandthepotentialeaseofaccesstothis
dataatany locationviaeasy-to-usedevices (withreliable Internetconnection,andmobiledevices)
increasetheneedforcontentminingtechnologiesthatboosthumancapacitiesintermsofspeedof
dataprocessing (Mayer-SchoenbergerandCukier,2013).TDMtechnologiescansupport individuals
aswellaslargerinstitutionsandcompaniesindecision-makingprocessesandinthedevelopmentof
novel ideas and solutions. In this sectionwe discuss diverse data classifications, technologies that
underlieTDM implementations,and the researchcommunities thatbring together the researchers
anddevelopersfrombothacademiaandindustrytoimproveandadvanceTDMalgorithms.
2.1.Datatobemined
TDMisnotdefinedthroughasinglescientificdomainortheory; it isacomplexsetoftechnologies
with different underlying theories and assumptions, consisting of components that are developed
withindiversedisciplines.AsFigure3shows,atamacrolevel,theTDMframeworkiscentredaround
data.AsweentertheBigDataera(Mayer-SchoenbergerandCukier,2013),theefficientstorageand
managementofcontentrepresentasignificantchallengeoftheirown,bothintermsoftheamount
ofinformationandintermsoftheheterogeneousnatureofthedata,whichcanbenumeric,textual,
multimodal,etc.Themorecomplexthedata,themorecreativeandinnovativetheTDMcomponents
have to be. Data can be structured according to its size and legal accessibility, according to its
sources,orbywhoproducesitthecontentthatconstitutesthedatacollections.
InSection2.1.1,weprovideanoverviewofthetypesofsourcesthatcangenerateandholddata,to
illustratethebreadthofthesourcesonwhichTDMmayoperate. InSection2.1.2,wesummarisea
newlyproposedclassificationofdatasetsaccordingtodatasizeandtypesoflegalaccess,inorderto
identifythetypesofdatasetsthatcouldbemoreeasilyaccessedandminedbyfor-profitandnon-
profitorganisations,inthecasethatcopyrightrestrictionsareadjusted.
Figure3.MacroviewofTDMframework.9
9ThisFigure(‘MacroviewonTDMenvironment’)byFutureTDMcanbereusedundertheCC-BY4.0license.
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
12
2.1.1.Datasourcesanddataholders
Datacomesincountlessforms.It iseasytothinkofdataasintentionallycollectedtextorscientific
data,butdatamaybealsoproducedbyhumansandbusinessesastheycarryoutdatafiedactivities,
leavingso-called“dataexhaust”asaby-productof theiractionsandmovements in theworld,e.g.
users’ online interactions, corporate and personal websites, and released digital documents and
archives (Mayer-Schoenberger and Cukier, 2013). Another type of data is represented by
measurements of uncontrolled events happening in nature as recorded by weather sensors,
telescopes,andotherinstrumentsacrosstheglobe.
Belowwelistdifferentdatasources,rangingfrompersonaldata,createdortrackedfromindividuals,
to professionally gathered and created content in formsof cultural knowledgedata, scientific and
businessdata,publicsectorinformation.AllthisdatacanbeeithersharedpubliclyontheInternet,
andstoredoffline.Whensharedpublicly,theoriginalcontentisenrichedwithadatadescriptionand
furtherinformationthatareavailableonawebsite.
Personaldata
ThetracesthateachInternetuserleavesformamassiveamountofinformationthathasvaluewhen
gathered and structured on amass scale. Currently this information ismostly given by individuals
eitherwithorwithouttheirconsent;inthefirstcaseoftenwithoutthemrealisinghowprofitablethis
information is,and in thesecondcase,without themeven realizing that their information isbeing
collectedinanyway.
States also collect information about individuals in the country, and can often compel people to
providethemwithinformation,ratherthanhavingtopersuadethemtodosooroffersomethingin
return, as private companies would have to. This special status allows a state to gather large
amountsofinformationthatcanbepotentiallyminedbyprivateorpublicinstitutions.Thiscanthen
beusedtooptimiseinteractionsbetweenthestateandindividuals,toimprovetheservicesprovided
bythestate,andtocreatebusinessmodelsaroundthisdataortousethisknowledgeforthegreater
goodofthesociety.Atthesametime,alongwiththeaccesstopersonaldata,statesareresponsible
forprotectingtheprivacyofthedataproducerswhenminingthisdataandreleasingittothegeneral
public.
Culturalheritagedata
There has been enormous investment in the digitisation of heritage data, i.e. data that are
consideredworthwhilearchiving,studying,orexhibiting.In2010,itwasestimatedthatitwouldtake
EUR10billiontodigitiseEurope’sculturalheritageholdings.3By2012,itwasestimatedthatatleast
20%of Europe’s culturalheritage collectionshadbeendigitised10.At the same time librarieswere
investing in preserving digital collections and exploring ways to increase the accessibility of this
content. Libraries use TDM to structure digital content, and to develop and explore further
possibilities for knowledge discovery e.g. via topic modelling. Improvements in the accuracy of
10http://pro.europeana.eu/enumerate/statistics/results
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
13
optical character recognition (OCR), including for handwritten texts11will increase the impact and
efficiencyof applying TDM todigital content inorder to enhancemetadataor analyse collections.
Google Books12 and Google Art13 operate on a worldwide scale, and digitise cultural (printed)
heritage for theirown technologydevelopment,aswellas tomake thisdataaccessible toawider
audience.TheHathiTrustDigitalLibraryisapartnershipmodelbringingtogetherdigitisedcollections
fromacrossUSinstitutionsandmakingthemavailabletoUSresearchersforanalysis.14AtaEuropean
level, Europeana, which collects metadata about heterogeneous data such as digitised artworks,
artefacts,books,videosandsounds,providesaccesstothismetadataandlimiteddatasetsviaAPIs15.
The DARIAH16 and CLARIN17 infrastructures illustrate the value and potential of access to these
extremely heterogeneous corpora for the research community. Both infrastructures support TDM
acrossawide-rangingcorpora,e.g.CLARINsupportsnaturallanguageprocessingandanalysisoftext
(includingnewspapers,webfora,books),andspeech(newsbroadcasts,signlanguage).
Scientificdata
Scientific data consist of twomain types: data that have been collected through experiments and
studies(measurementsofanykind,texts,videos,etc.),andthescientificpublicationsthatdescribe
the work after analysis has been performed (containing data description, methods, experimental
results,etc.).
Data used in experiments may be released to a general audience, along with the publication
describingthestudy.Thisjointreleaseofdata,withcorrespondingpublicationandrelatedmetadata,
is considered to have a beneficial side-effect: it promotes scientific research reproducibility and
scientific integrity. An increasing number of researchers and international research organisations
perceivethistobeapositivestepforwards(ICSU,ISSC,TWAS,IAP,2015).18
Scientific data can be released purely as a data set collected for general purposes. In this way,
researchers in business or academia can access the data and decide themselves on the type of
experimentstorun.AnexampleofthisapproachistheopenaccessreleaseoftheEUsatellitedata
fromCopernicusEarthobservation19.The initiativetoaskthequestionsandtoanswerthemlies in
thehandsoftheaudience.OpendataanddatasharinginresearchisontheincreaseinEuropedue
to funder-mandates such as the H2020 Open Data Pilot, and the increasing availability of
infrastructurefordatasharinge.g.genericservicessuchasEUDAT20andZenodo21.Onaworldwide
11http://transcriptorium.eu12https://books.google.com13https://www.google.com/culturalinstitute/project/art-project
14https://sharc.hathitrust.org15http://labs.europeana.eu/api16https://de.dariah.eu/tatom17http://www.clarin.eu18 FORCE11 MANIFESTO: Improving Future Research Communication and e-Scholarship:https://www.force11.org/about/manifesto19http://www.eea.europa.eu/highlights/eu-satellite-data-to-be20http://www.eudat.eu
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
14
scale,anexampleistheResearchDataAlliance(RDA)22,aforumwheretheinfrastructuresfordata
sharingacrossdisciplines,institutions,andcountries,arediscussed,andguidelinesareformulated.In
addition, industryisalsobeginningtomaketheirscientificdatasetsavailable,e.g.clinicaltrialdata,
oftenforethicalortransparencyreasons.
Scientificpublicationscanalsobeconsideredasavaluabledataset forTDM.Thesetexts represent
potentialsourcesofnewknowledgewhentheinformationtheycontainissummarisedandanalysed
incomparisonwithotherstudiesandlinesofenquiryfromotherdisciplines.Oftencontainingfigures
anddatavisualisations,articlesareofvaluenot justbecauseof thetext theycontain,butbecause
they are important sources of raw data and context. TDM analysis of publications can help find
connections between studies in different domains that are implicitly or indirectly connected
(Swanson, 1986). In the field of bioinformatics, the literature has been mined to analyse the
interactions of genes andproteins in various diseases.Often this typeof analysis is performedon
Medline23abstracts,astheyarefreelyavailablebutithasbeennotedthatthisisextremelylimiting.
VanHaagenetal.foundthatonly32%ofknownProteinProteinInteractions(PPIs)couldbefoundin
Medline abstracts, and therefore the rest could only be found by accessing the full text literature
(vanHaagenetal.,2009).
Scientific publications that are released via openaccess are, bydefinition, accessible for TDMand
TDM-basedapplications.PubMedCentral24hasbecomeawidelyusedsourceforTDM.TheEuropean
open access publication infrastructure OpenAire25 not only supports TDM by encouraging storing
articles under open access licences and in interoperablemachine readable formats, it also utilises
TDM to infer links between data and publications, and to provide funders with data regarding
publications arising from research funding and research trends26. Subscription content is less
accessibleas,withtheexceptionoftheUK,Europeanresearcherscanonlyminethiscontentunder
conditions specified by licences. In contrast to this, in theUS, researchersmaymine content that
theyhavelegalaccessto,withouttheneedforadditionallicencepermissions.
Businessdata
Businessesarecreatingandcollectingdatafromallpossiblesources,asminingdatafromcombined
sources improves insights into internal processes, strategies, and themarket. These data can vary
fromenergy consumption by business agents, such as shops and carriers, to personal information
about customers and their digital and financial transactions. Most of this data constitutes value
withinthecontextofaknowledge-basedeconomy,andissensitivetodataprotectionissues,thus,is
usedinternallybycompaniesandisnotreleasedpublically.
21https://zenodo.org22https://rd-alliance.org23https://www.nlm.nih.gov/bsd/pmresources.html24http://www.ncbi.nlm.nih.gov/pmc25https://www.openaire.eu
26https://blogs.openaire.eu/?p=88
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
15
Publicsectorinformation(PSI)
PSI isdataproducedandmadeavailablebygovernmental institutions.Today,agrowingnumberof
governmentsrealisethatthisdatacanbringmoreprofitandbenefit tosocietywhenreleasedtoa
general audience. PSI data release initiatives allow commercial companies that are already
experienced in datamining, to use this additional information source to build better products for
theircustomers,consequentlycreatingmorejobsandboostingcommercialsuccess.
The European Union Open Data Portal27 gives access to a growing range of PSI data from the
institutions and other bodies of the European Union (EU); it is free for both commercial or non-
commercial purposes. The EU Open Data Portal is managed by the Publications Office of the
EuropeanUnion.ImplementationoftheEU'sopendatapolicyistheresponsibilityoftheDirectorate-
GeneralforCommunicationsNetworks,ContentandTechnologyoftheEuropeanCommission.
Opendataportalshavealsobeensetupby individualEUMemberStates.28Thedatasetsofpublic
documentsthattheseportalsshare,aremostlyunderCC-0 license, i.e. theownersdidnotreserve
anyrights,andthedataisfreeforsharinganduseinanycountry,byanyinstitutions.Thesedatasets
representthedocumentsinthenativelanguagesofeachspecificcountry.29,30,31,32,33
Internetdata
Sincetheintroductionofthefirstwebsitein1991,theWorldWideWebhadexpandedtomorethan
45 billion pageswith uniqueURLs in the late 2010s (Bosch et al., 2016).Manywebpages contain
valuable information that can be crawled for further mining. More generally, the Internet (the
infrastructureonwhichtheWorldWideWebisbased,butalsoservicessuchasinternet,email,and
filetransfer)allowsaccesstolargeamountsofdataofalltypes,thatcanbecrawledwiththeproper
credentials. Many data remains, of course, hidden behind passwords and firewalls. The so-called
DeepWeb, the part of theWorldWideWeb that is not directly accessible for indexing by search
engines,hasbeenestimatedtoholdatleast500timesasmuchdataastheindexed(‘surface’)web.34
2.1.2.Datatypes
AnytypeofdataconsideredtobeproperlyavailableforTDMmustfirstbeavailableas,orconverted
to,astandardisedandmachine-readableformsothatitcanbeeasilyprocessed.Intheir2014report
to the European Commission entitled 'Standardisation in the area of innovation and technological
development,notably in the fieldofTextandDataMining',Hargreavesetal. (2014) identified five
27https://open-data.europa.eu/en/data28http://ec.europa.eu/newsroom/dae/document.cfm?action=display&doc_id=8340 29https://data.gov.uk30https://data.overheid.nl31https://www.data.gouv.fr32https://www.govdata.de33http://data.norge.no34https://en.wikipedia.org/wiki/Deep_web
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
16
typesofdatabasesthatlendthemselvestoTDM,accordingtotheirsizeandaccesspolicies,aswellas
theirpotentialforrevenuestructurechanges:35
1.XXL:alldatabehindfirewalls,i.e.companiesandorganisations’internaldatabases,notaccessible
tothepublic.36
2.XL:allpubliclyaccessibledatanotbehinda firewallorpaywall.e.g. freelyaccessiblenewspaper
websites.
3. L: all publicly accessible data located behind a paywall. e.g. online newspaper pages that are
accessiblewithasubscriptionfee.
4 M: all publicly accessible data behind a paywall whose clients are mainly researchers and
companiesinneedofreports.e.g.Reuters,Bloomberg.
5.S:scientificpublishers’databehindapaywall.
Thisclassificationcoversdatacollectionsofdifferentsizesthatcontaincontentofadiversenature.
Wecanmakeadistinctionbetweenhigh-leveldatathatislikelytobesubjecttocopyrightprotection,
and low-level data that is highly unlikely to be protected under copyright law (although data
protectionlawmayapplyincertaincases):
- Protected:text,image,sound,multimedia;
- Non-protected,unlessthecontentbecomespartofthedatabasethatthecompanyor
institutiondidputsubstantialinvestmentinto:e.g.clicks,likes,GPScoordinates,timestamps,
andmeasurementsofadifferentkind.
2.2.TechnologyandR&D
Textanddatamininghasgraduallyevolvedintoalargedomainwithcomplexapproachesforarange
ofapplications.Historically,thedevelopmentofminingtechnologieshasitsrootsinthemid-1700s,
withtheinventionofmathematicalmodels,regressionmethods,andstatisticalanalysis.Duetothe
adventof computersand systems for storing largerquantitiesofdata, thepossibility tominevery
largedatasetsandfilterrelevantinformationhasgreatlyexpandedwhatusedtobelaboriousexpert
analysiswork. Since the beginning of the 1990s,modern datamining tools have appeared on the
market and in labs, allowing more advanced analysis, to discover hidden patterns and to extract
implicit knowledge,asenvisagedand testedbySwansonD.R. (Swanson,1986).With the invention
andadoptionoftheWorldWideWeb,researchersexpandedtheireffortstodeveloptextanddata
miningmethodstosearchandexploretheweb.This ledtothecreationofsearchengines,starting
from simple interface-based browsers like Mosaic and Netscape, and developing into large-scale
systems likeGoogleandBing. It represented theopeninggambit towardsmorecomplexmethods,
not only based on keyword search, but also on their correlation, their similarity, their joint
probabilities,andtotheapplicationoftextminingtodifferentfields(suchasthebiomedicaldomain,
35Theoriginaldocumentusestheterm‘database’;wereplacedthisby‘data’.36ThesecollectionsareoutsidetheFutureTDMscope.
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
17
economics, social sciences, or humanities), and types of data and content (such as scientific
publicationsandresearchdata,socialmedia,multimedia,orpublicsectorinformation).
Figure 4 shows a zoom-in view of the TDM environment, outlining the main research fields that
develop,test,andfinallyprovidethealgorithmsforTDMimplementationsforeachspecificdomain
of use. The core scientific areas aremachine learning,which ismachine learningof tasks and the
discoveryofpatternsandstructureindata;informationretrievalandsearch,whichfocusesontheoptimisation of search methods that index vast amounts of content and allow the location and
extraction of the most relevant facts for any question and user type; and natural languageprocessing techniques, which deal with human language in its textual and spoken form, covering
humanity’s past as well as reflecting current trends and topics. This scheme does not include an
exhaustive listof techniquesand researchdirections; itsaim is to illustrate thescaleandpotential
scientific interaction,and thebroadknowledge required tocreatea successful, scientifically sound
TDMapplication.
Below,webrieflyintroducethemainresearchdirectionsthatconstitutethecomponentsoftextand
dataminingapplications.
Figure4.MicroviewonTDMenvironment.37
Machine Learning (ML) studies the automated recognition of patterns in data, and develops
algorithmsabletolearntaskswithoutexplicit(human)instructionorprogramming.Learningisbased
37ThisFigure(‘MicroviewonTDMenvironment’)byFutureTDMcanbereusedundertheCC-BY4.0license.
Information Retrieval
and Search
Artificial Intelligence
Classification
Clustering
Regression
Association Rules
Statistics
Information Extraction Concept
Extraction
Opinion Mining
Summarisation
Negation and Modality
Detection
TextualEntailment
Natural LanguageProcessing
Machine Learning
Machine Translation
Predictive Analytics
Multimedia Processing
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
18
ondataexplorationandisaimedatgeneralizingknowledgeandpatternsbeyondtheexamplesthat
arefedtothealgorithmattrainingstage.
Informationretrieval(IR)canvaryfromsimpleretrievalofalltheitems(e.g.fulldocuments)withina
collection that contain user-requestedwords inmetadata fields or structured databases, tomore
sophisticatedsearchandrankingoftheresultsbasedonthepresenceofcombinationsofimportant
wordsandnumbersindifferentsectionsoftextsordatafields(Baeza-YatesandRibeiro-Neto,1999).
It is important to note that, although TDM uses information retrieval algorithms as one of the
componentswithintheTDMframework,itgoesbeyondthesimpleretrievalofwholedocumentsor
their parts. Within the IR framework, the task of knowledge processing and of new knowledge
creation still resideswith the user, while in the case of full TDM technology implementation, the
extractionofnovel,never-beforeencounteredinformationisthemaingoal(Hearst,1999).
Naturallanguageprocessing(NLP)aimsatdevelopingcomputationalmodelsfortheunderstanding
andgenerationofnaturallanguage.NLPmodelscapturethestructureoflanguagethroughpatterns
that map linguistic form to information and knowledge or vice versa; these patterns are either
inspiredbylinguistictheoryorautomaticallydiscoveredfromdata.TheNLPcommunityissubdivided
intogroupsthatfocusonspokenorwrittencontent,i.e.speechtechnologies(ST)andcomputational
linguistics (CL). ST researchers work with audio signals, synthesising and recognising speech and
paralinguisticphenomena.Oncespeechcontenthasbeentranscribedinatextualrepresentation,it
canbefurthertreatedastextforTDM.CLresearcherstargetsyntacticanddiscoursestructures;their
algorithmshelpusunderstand thebasic concepts, relations, facts,andeventsexpressed in textual
content.
ThevisualisationoftheoutputoftheML,IR,andNLPalgorithmsisanintrinsicpartofalltheTDM
constituenttechnologies.Visualisationisvaluableforresearchersinfieldslikebioinformaticswhere,
forexample, theminingofproteinbehaviourdescribed inasetofpublications isvisualised inone
scheme;orintextmining,whenawordcloud,withwordsindifferentsizeandcoloursdependingon
their frequencies, gives a general impression of the most frequent terms used in a collection of
documents.
Statistics, as a field of mathematics in general and machine learning in particular, provides its
practitionerswithawidevarietyoftoolsfordatacollection,analysis, interpretationorexplanation,
andpresentation.
Classification/Categorisation/Clustering of content within a collection in order to split (in asupervisedorunsupervisedmanner)theseintogroupsofcertaintypesbasedoncontent,representa
TDM preprocessing step at the collection level, as these categories are used for further specific
miningofthecollection’scontent.Machinelearning(ML)algorithmsrepresentadominantsolution
forthistask,e.g.theNaïveBayesapproach(Lewis,1998),decisionrules(Cohen,1995),andsupport
vectormachines(Joachims,1998).Whileclassificationalgorithmsidentifywhetheranobjectbelongs
toacertaingroup,regressionalgorithmsestimateorpredictaresponseasavaluefromacontinuous
set.
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
19
Association rule-learning algorithms allow for the discovery of interesting relations between
variables within large databases (Agrawal et al., 1993). For example, when a large database of
customer shoppingbehaviour ismined, association ruleshelp todefinewhat typesofproductsor
services are usually sold together. This information can help to adjust and tailor marketing
campaigns,aswellasmakeproductarrangementsinthestoresmoreefficient.
Informationextraction(IE) isawideconceptthatcoversthe inferenceof informationfromtextual
data.Examplesoftypesofextractedinformationarestatisticsontheusageofterms,theoccurrence
of named entities in text and their relations or associations according to certain classification
schemes,andthepresenceofspecificfactsorlogicalinferencesexpressedintexts(e.g.interactions
betweengenes,logicalproofs,ortheinferenceoflogicalconsequencesandentailmentexpressedin
thetext).
Term or Concept extraction is a first step for further IE, as it is necessary to define the terms of
importance for thecorpus to start itsprocessing.Once these terms,being specific for the topic to
conceptualiseagivenknowledgedomain,areextracted,anontologyofthefieldcanbecreatedfor
furtherIEtaskimplementations.
Negationandmodality(orhedge)detectionisanimportantaspectofTDM,asnegationandhedging
constructionsinthetext(suchas“OurresultscastdoubtonfindingXofAuthorZ”or“Wefoundthat
findingZdoesnotapplytocontextY”)offercrucialevaluationsofreportedfacts.
Sentimentanalysis,oropinionmining,aimstodeterminetheattitudeofaspeakerorawriterwith
respect tosometopicor theoverallcontextualpolarityofadocument.Theattitudemaybehisor
her judgmentorevaluation, theaffective stateof theauthor,or the intendedemotion theauthor
intendstoconveytothereader.
Summarisationofagiventextorasetoftextsisthetaskofcreatingashortertextthatcapturesthemost crucial information from the original texts. First, the crucial content to be summarised is
defined,andthen,dependingonthetask,itiscondensedintoanewrepresentationthatmayconsist
of sentences or parts from the original documents (extractive summarisation), or that may be a
newly created text ormultimedia document that rephrases the summarisedmessage (abstractivesummarisation).
Predictiveanalytics covers themethods that analyseprevioususageor transactions statistics and,
basedontheseresultsandtrends,offerspredictionsonfurthersystemsdevelopment.
Machine translation targets human quality level of translation ofwritten or spoken content from
onelanguagetoanotherusingcomputers.Althoughitoriginallystartedasrule-basedsystemstaking
linguistic knowledge into account, nowadays thepioneers in the fielduseBigData size collections
anddeeplearningalgorithmstotraintheirsystemsandtoachieveperformanceimprovements.
Multimediaprocessingallowsfurtherminingofallaspectsofmultimediacontent,varyingfromthe
audio stream of a given video to visual concepts extraction and annotation, person and object
discovery and localisation on the screen. This direction of research and applications has gained a
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
20
greatdealof importance,due to the fact that contemporary Internetusers shareandcreate large
quantities ofmultimedia content that, for example, reflect their opinions and spread information
aboutevents.
As depicted in Figure 4, the methods listed here are not insulated domains of research and
application;theyrepresentacomplexinteractivesetoftoolsandmethodsthatcanberecombinedin
multipleways,dependingontheneedsofthespecifictaskandtheavailabilityofdataandexpertise.
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
21
3.TDMACROSSECONOMICSECTORS
Since the beginning of humanity, the development of novel technologies has defined societal
structure and its employment distribution. Agriculture and natural resource extraction form a
primary sector, as this is an initial type of human joint-effort to produce food for survival. The
inventionofmachinesandelaboratetoolsledtothedevelopmentofthesecondarysector,whichhas
evolved intocomplexmachineryengineering,construction,andspaceexploration.Thethirdsector
comprises all the services that society members deliver to each other, varying from retail to
transportation, and from medical healthcare support to entertainment. Until recently, this three
componentstructurerepresentedmostofthesocietiesintheworld.However,thegrowthofdata-
driven business, services and research, has led to a discussion on the changes in the economic
structurethatshouldreflectanoveltypeofknowledge-basedeconomy,whereknowledgeanddata
become a separate valuable commodity (OCDE, 1996). Thus recently, the research that supports
knowledge sharing and growth, aswell as education of the professionals that can carry out these
activities,havebeentermedaquaternaryeconomicsector.Moreover,thehighleveldecisionmakers
in governments, large industry companies, and education, who have the potential to shape the
futureoftheentiresectorswiththeirvisionandfollowingdecisions,havebeenplacedinaseparate,
quinarysector.
Forthepurposeofourreport,weareinterestedinthecurrentandpotentialpresenceofTDMinall
economicsectors.WedepicttheserelationsinFigure5.Theprimary,secondary,andtertiarysectors
followroughlythesametypeofinteractionwithTDMtechnologies.Thesesectorsrepresentsources
of all possible types of data that are available for mining, and at the same time, have tasks and
challenges that require smart solutions. The quaternary sector is particularly central to TDM
development and implementation, as it produces the TDM experts and advances the technology
itself.Expertsatthedecision-maker leveluseTDMtoolstotestandprovetheirvisionofeconomic
development,aswellas tohypothesisewherea specificdecisionwill lead to,and tomeasureand
assessitsprogress.
In the sections below,we list examples per sector and subfield on how TDM is used, in order to
illustratethebroadnessofTDMuptakeanditspotentialfordevelopment.
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
22
Figure5.GeneraleconomicstructureandconnectionbetweenTDMandalleconomicsectors.38
3.1.Primarysector
Making the best use of limited natural resources with a constantly growing population is one of
humanity’s ultimate challenges. Taking on the challenge requires the understanding of changing
patternsinclimateandenvironment;TDMhasthepotentialtodiscoversuchpatternsautomatically
from data. The primary data that can be used for TDM are collected in the form of satellite
measurements,weathersensorsonthegroundandbelow, intheair,and inthewater.Otherdata
arederivedfromeconomic indicators fromtheprimarysectors:productionnumbers,profits,stock
marketvaluepatternsofminedgoodssuchasoil,coal,andgold,andcropssuchaswheat39.
3.2.Secondarysector
Anylarge-scalemanufacturingplantordistributednetworkofpipelinesispackedwithvarioustypes
of sensors that produce a continuous stream of measurements that are crucial for maintenance
analysis,suchastemperature,pressurelevels,leakdetections,etc.Whenanalysedonalargescale,
this information offers a basis for discovering and understanding strategies for cost savings and
safety measures. For example, calculating optimal routes for product delivery can boost fuel
efficiency,leadingtoasmallerecologicalfootprintaswellastocostssavings;knowledgeofelectrical
gridusageorgaspipelinesintegrityhelpstoschedulemaintenanceofthemostsensitivepartsofthe
38 This Figure (‘General economic structure and connection between TDM and all economic sectors’) byFutureTDMcanbereusedundertheCC-BY4.0license.39http://www.agroknow.com/agroknow
PRIMARY SECTORAgricultureCoal MiningNatural Resources IndustriesWholesale Trade
TERTIARY SECTOREntertainmentFinance, Banking, InsuranceHealth and Social CareIT ServicesPublic AdministrationRetail/TradeTelecommunications and JournalismTourismTransportation
QUARTENARY SECTOREducationResearch and Development (e.g. Tech)
TEXT AND DATA MINING
QUINARY SECTORHigh Level Decision Makers in Government, Industry, Commerce, and Education
SECONDARY SECTORAerospaceAgro-IndustryBio-TechnologyConstruction
EngineeringManufacturingMicroelectronics
Across Economic Sectors
Technology
Data &Tasks
Technology
Data &Tasks
Tools for Analysis
Industry or Gov. Experts
Data &TasksTechnology
Technology TDM Experts
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
23
system (Mayer-Schoenberger and Cukier, 2013), (Auclair D., 2016); data-driven smart grid
applicationscansignificantlylowerCO2emissions(ICSU,ISSC,TWAS,IAP,2015).
3.3.Tertiarysector
ThesectorofprovidedservicesembracesTDMatadifferentpacedependingontheTDMexpertise
andawarenessinthefield,BIgDataavailability,andthesensibilityofthisdataintermsofprivacy,as
muchmorepersonaldataisminedwithinthissector.
Companiesallocategreateffort to interactingwith theirexistingcustomers,as thisallows themto
monitorcustomerfeedbackontheproducts,andatthesametimetoprofilecustomersforfurther
targetedproductpromotion.Ontheotherhand,informationaboutusers’Internetsurfingbehaviour
isgatheredintheformofclicklogs,websiteviewsandsearchqueries.Oncethisdataiscategorised
and profiled, the advertising network industry can match the right advertisement from the right
company to the suitable client profile. Sentiment analysis of customer comments in global social
networks,suchasTwitterandinternetforums,allowscompaniestotrackcustomerinterestintheir
products,aswellastousethesechannelsfortargetedadvertisingofservicesandproducts.
3.3.1.Finance,Banking,Insurance
In the financedomain,predictiveanalyticsplayacrucial role.TDMbasedanalyseshelptomanage
risksandtomake informeddecisions.Banksand insurancecompaniesuseTDMtobuildconsumer
profiles anddevelop automatic classifiers todetermineeligibility for financial products, and to set
insurancepremiums.Customerparameterscan includediversetypesof informationgatheredfrom
different sources, ranging from consumers’ zip codes, employment background and educational
history,totheirshoppinghistory,socialmediausageandfriendshipnetwork(RamirezE.etal.,2016).
Financial institutions typically purchase this information from consumer reporting agencies (CRA)
thatcrawldatafromtheInternetandfromspecialgovernmentandprivatedatabases.
3.3.2.Healthandsocialcare
Both health care providers and consumers benefit when treatment plans andmedical advice are
tailoredtoindividualpatients.TDMhasgraduallybecomepartofthehealthcaresystem:inrouting
and treatment tailoring, treatment monitoring, as well as in claims processing and insurance
payments(AuclairD.,2016).Inaddition,BigDataisusedbydifferentmedicalinstitutionstopredict
the likelihood of hospital readmission for different categories of patients and types of disease
combinations.
MedicaltreatmentsandpredictionssuggestedbyartificialintelligenceandTDMarebeingintegrated
indecision support systems fordoctors (e.g.Babylon40 in theUK, IBMWatson41 in theUSA). Such
systems,poweredbyAI,canhelptolowerthecostsinregionswithashortageofspecialtyproviders.
40http://www.wired.co.uk/news/archive/2014-04/28/babylon-ali-parsa41http://www-03.ibm.com/press/us/en/presskit/27793.wss
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
24
Moreover, medical TDM systems can flag prescription anomalies based on patient records, thus
helpingtopreventmedicationerrors(AuclairD.,2016).
3.3.3.Publicadministration
Publicadministrationcollectdataaboutdifferentaspectsofthelivesofthemembersofsociety,such
asownershipofmovableandimmovableassets,useofthehealthcaresystem,employmentstatus,
and family situation.This information isused tokeep trackof societaldevelopmentsandchanges,
and to comeupwithadjustments topoliciesand reallocationof services.A substantial amountof
this informationstaysofflineandisnot interconnected. Ifthiscontentwouldbemadeopentothe
public throughopengovernment initiatives,manytypesofbusinessesandnon-profitorganisations
wouldhaveachance toput thisdata touse,e.g.byconnecting thePSIdata to theirowndata,or
building new products and services on top of the released PSI (Mayer-Schoenberger and Cukier,
2013).
Likethefinance industry,publicadministrationcanbenefit frominternalTDMimplementations,as
intelligent mining and automated auditing of the data can help to detect and to prevent fraud,
currently leading to financial losses in tax collection; TDM may generally help to optimise the
workflow.
3.3.4.Retail
Inretail,trackingandminingofpurchasetransactionsbycustomersallowscompaniestoadjustthe
rangeofproductsondisplayaccording to thepreferencesof thecustomers inaparticulararea. In
additiontobetterfittedproductselection,foodretailerscanadjusttheirfacilitiesarrangements,for
example,refrigeratortemperatures,basedonlarge-scalestatisticsofalltheshopsinachain,which
inturnhelpstosaveontheenergybills.
Somemarketplacesaresetupassalesplatformsforindividualsandcompaniesincompletelyvirtual
environments,likeinternationalgiantseBay42andAmazon43ortheFrenchsmallersizeequivalent,Le
BonCoin44.Inthecontextofonlineshopping,theminingofuserbehaviour(clicks,productsselected
orbought,timespentonthepage,browsersused,timeofthedaywhenthewebsiteisaccessed)is
increasingly being used for product recommendations and reminders, which eventually increases
sales.
3.3.5.Telecommunicationsandjournalism
Thetelecommunicationsindustrygeneratessomeofthelargestmultimediadatasets,andTDMhelps
to filter and to play already available content that is of interest to clients depending on their
preferences.Mining customer preferences, in turn, influences decisions on allocating resources to
thecreationofnewcontentthatwillpotentiallybesuccessful.
42http://www.ebay.com43https://www.amazon.com44http://www.leboncoin.fr
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
25
Journalists always needed diverse and reliable information sources to report news events and to
bring an analytical perspective to a general audience. Nowadays, journalists have to dealwith an
avalancheofdatasources,thusminingisseenasavaluabletoolforinvestigativejournalismtofind,
confirm,andeventuallytoprovestoriesinamoretransparentway(DiakopoulosN.,2014).Miningof
blogs,websitesandtwitter feedsmakes journalistsawareof theiraudience’s interestsand levelof
knowledge, which helps them to shape their social interaction accordingly and to select relevant
news pieces (Flaounas I. et al., 2013). At the same time, the journalists have to be careful when
redefiningtheiragendausingTDM.TheyneedtocumulatetheknowledgegainedfromTDMwiththe
actualstateofsociety,asnotallsocietymembersarerepresentedinthesocialmediaorwebdata
availableforcrawling.Intheirworkanduseofdata,journalistshavetofindabalancebetweendata
personalisationandtheecologyofcommonknowledge(LewisS.C.andWestlundO.,2014).
3.4.Quaternarysector
This information and knowledge-rich sector focuses on knowledge gathering, processing, and
creation, via research practices and teaching societymembers. In this sector, TDM algorithms are
boththesubjectofresearchandasetoftoolsthatenableresearch.
3.4.1.Research
The speed of scientific progress depends to a large extent on the framework that supports the
sharing of novel ideas and key findings within and across communities, the reproducibility and
comparabilityofresults,andtheresearchimpactmeasurements.TDMtechniqueshavethepotential
toaddvalueandincreaseefficiencyforallthesecomponents.
AsmentionedinSection1andillustratedwithexamplesinFigures1-2,researchersnowadayshave
tokeeptrackofanenormousnumberofpublicationstoretainanup-to-dateunderstandingofthe
research in their field. Moreover, a growing amount of research happens at the intersection of
severaldomains,ormethodsarebeingadoptedacrossdomains.Thus,thecurrentvolumesofcross-
domaincontent render them infeasible tobe readbyhumans.TDMcanhelpresearchersdiscover
newconnectionsandtrends,allowingthemtomoreoptimallyusetheirtimeandotherresources.
Thequalityofscientificresultssharedwithinthescientificcommunityreliesandheavilydependson
thesystematicreviewingofsubmittedarticlesforpublication.Thisrigorousprocessguaranteesthat
theresearchoutcome,whenaccepted,bringsnovel informationtothefieldofknowledge,andthe
results have been achieved following the standard procedures that safeguard reproducibility and
reliability. Ideally,highqualityreviewingreliesonfullawarenessofall thepublications inthefield.
However,withtheincreasingnumberofpublishedstudiesatagloballevel,itbecomesunfeasiblefor
anyhumanresearcher/reviewertokeeptrackofallstudiesthathavebeenreported.Therefore,TDM
canprovidevaluablesupportforreviewers,asitcanreducetheworkloadandminimiseanypotential
bias.
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
26
Furthermore,thequalityofpublishedresearchcanbemeasuredwithusingTDM.Forexample,smart
mining of publication references that takes into account the discourse structure in case of each
quotation,yieldsamoresubtleandinformativeoverviewofthewholecitationsnetwork.
Clinicalstudiesrequireameaningfulsamplesizeofpatientswithcertainconditionsinordertotest
theeffectivenessofnewdrugsormedicalprocedure.Pharmaceuticalcompaniesthatdevelopthese
drugsandtreatmentsrelyonintermediarycompaniesthatcarryouttheactualscreeningofmedical
recordsforsuitablepotentialcandidates.TheseintermediariesworkonalargescaleandrelyonTDM
technologiestoestablishpatient-clinicalmatches,aswellastosupportpost-trialpharmacovigilance
throughearlydetectionofadversedrugevents.
3.4.2.Education
Traditional educational institutions can use BIg Data techniques to identify students for advanced
classes,whileat thesametimedeterminestudentswhoareat riskofdroppingoutormightbe in
needofearlyinterventionstrategies.TheGatesFoundationhasinvestedinanumberofprogrammes
to improve the quality and availability of educational data with the aim of developing policy and
practicetoimprovetheimpactofeducation,andtobooststudentachievement45.
Massive open online courses (MOOC), introduced in the late 2000s, are currently reshaping the
wholeconceptofdistanceeducation,andhaveapotentialtoimpactthewholeeducationalsystem,
as theymake it possible tomonitor and assess individual class performance based on large-scale
miningofstudentprofiles,attendance,taskcompletion,andforumdiscussions.
3.5.Quinarysector
Funders and high-level decision makers rely on their general visions of technology and societal
development paths, as well as the impact-metrics that assess potential economic, political, and
societalvalue.TheseexpertscanuseTDMtominethecurrentstatusofaffairs,andtopredict the
outcomes of diverse types of interventions. Moreover, sentiment analysis of social networks
activitiescanprovidethemwithfastfeedbackfromthepopulationonactionstaken.
45 http://www.gatesfoundation.org/Media-Center/Press-Releases/2009/01/Foundation-Invests-in-Research-and-Data-Systems-to-Improve-Student-Achievement
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
27
4.SUMMARYOFTHELITERATURE
Astextanddataminingharbouralargepotentialacrosspracticallyallsectorsoftheeconomy,their
legal,economic,andpoliticalimplicationshavebeendiscussedinseveralexpertreportsandfromthe
pointofviewofvariousstakeholders.Thisreportisbasedonseveralofthesereports;inthissection
we summarise the most prominent reports published in recent years for further reading and
reference.Wehighlightthepointswhichcallforurgentlegalandeconomicchangesthatwouldallow
TDMtounleashitsfullpotential.Wealsohighlightcaseswhereuptakesuccessismentioned,andwe
outlinethedirection inwhichdiscussionsdevelop.Figure6depictsthegeneraltimelineofreports,
their research questions and focus (Legal aspects, Economic factors, Examples, Technology). It is
worthnotingthatreportscreatedonbehalfofgovernmentsandacademiaaremoreoftenreleased
via open access, even when produced by private companies. Therefore, they are available for
analysis,whileprivate sector reports, usually createdas aproduct for sale, staybehindapaywall,
restrictingouraccesstotheircontent.
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
28
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
29
Figure6.Timelineofreports.46
Year Title Authors Researchquestions/Focus
Dataandmethods Keyfindings
2011 DigitalOpportunity.AReviewofIntellectualPropertyandGrowth.47
I.Hargreaves ThisreportaimstofindeconomicevidencetosupportchangesintheIntellectualPropertylawsattheleveloftheUK,andfurtherataunifiedEUlevel.
Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices.
TwoimportantareasofutilityofTDM:(1)costsavings,(2)widereconomicimpactgeneration
2011 BIgData:Thenextfrontierforinnovation,competition,andproductivity.48
McKinseyGlobalInstitute(MGI)J.Manyika,M.Chui,B.Brown,J.Bughin,R.Dobbs,C.Roxburgh,A.HungByers
ToexploreBigDatapotentialwhenusedineveryeconomysector,itsinnovationpotential,andtheaddedvalue.
Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices,aswellasinternalMGIresearch.
ThisreportgivesageneralperspectiveontheBigDataTDMbeingincorporatedintoalleconomicsectors,asdataficationreachesallsectorsoftheglobaleconomy.Theexamplesofpotentialprofits,aswellasraisingissues,arelistedforbothEUandUSAmarkets.-Issues:datapoliciesinthecontextofBigData,privacyprotection;storageandaccessfacilitiesandnoveltechniquesthatneedtobeimplemented.-BigDataanditsprocessingcreateadditionalvalueforbusinessesinanumberofways:improvedandmoreusertargetedservicequality,overalltransparencyofperformance,andthesupportofhumandecisionswithreliableanalytics.
2012 TheValueandBenefits D.McDonald,U. Toexplorecosts, Consultationswithstakeholdersand -Theavailabilityofmaterialfortextminingislimited.Evenin
46ThisFigure(‘Timelineofreports’)byFutureTDMcanbereusedundertheCC-BY4.0license.47https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/32563/ipreview-finalreport.pdf48http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
30
ofTextMiningtoUKFurtherandHigherEducation.DigitalInfrastructure.JISC.49
Kelly benefits,barriersandrisksassociatedwithtextminingwithinUKfurtherandhighereducation
casestudies(literaturereviewinsystemsbiology,textminingtoexpediteresearch,increaseofaccessibilityandrelevanceofscholarlycontent.
caseofearlyadoptersinthefieldsofbiomedicalchemistryresearch,theminingislimitedtotheOpenAccessdocuments.-Mainbenefitsareincreasedresearchers’efficiency,improvedresearchprocessandquality.Thereportdemonstrateshowmuchmoneyintermsofpersontimecanbesavedorusedforproperresearchtaskswhenthescientificcontentisminedautomatically,leavingtheresearcherswithanalysischallengesandpureresearchquestions.-Barriers:legaluncertainty,inaccessibleinformation.
2012 BusinessintelligenceandAnalytics:FromBigDatatoBigImpact.50
MISQuarterly
ThemainfocusofthisrevueistodescribeanddiscussBusinessintelligenceandAnalytics(BI&A)implementedfordiversemarketapplicationsthatusemanydatatypes(fromuser-generatedcontenttoFederalReserveWireNetworkinformation)
-BibliometricstudyofcriticalBusinessintelligenceandAnalyticspublications
-BI&Ainitscurrentstate(2.0)processesweb-basedunstructureddata,andthenextstepforthetechnology(3.0)willbetousemobileandsensor-basedcontent.
2012 ResearchTrends.SpecialIssueonBigData.51
Thisspecialissuehasabroadperspectivevaryingfromapplicationsscenarios
-- -Bibliometricsanalysisconfirmsthat‘ComputerScience’and‘Engineering’aretheleadingtopicsinBigDatapublications,andUSleadsthepublicationsraceinthefield.
49https://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining50http://hmchen.shidler.hawaii.edu/Chen_big_data_MISQ_2012.pdf51http://www.researchtrends.com/wp-content/uploads/2012/09/Research_Trends_Issue30.pdf
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
31
toinfrastructureissuesforTDMinBigDatacontext
2012 META-NET.StrategicResearchAgendaforMultilingualEurope2020.52
METATechnologyCouncil
Thisreportfocusesonthecurrentstateofandthefurtherpotentialfordevelopmentoflanguagetechnologiesonaglobalscale,andmorespecifically,theEuropeanlandscape.
-MultilingualismisaanimportantfeatureofEurope,asitmeansthatthereisneedforthediversificationandlocalisationofthetoolsthatwillbeabletoprovideservicesatthesamelevelforthewholevarietyofEuropeanlanguagesandcustomers.
2013 e-IRGWhitePaper.53 e-IRG ThisreportfocusesontheinfrastructureinitiativesthathavealreadybeentakenandespeciallyonthosethatareurgentlyneededinordertosupportTDMonalargescaleacrosscommunities.
-InfrastructureneedssupportatnationalandEUlevel,andrequirescollaborationwithbusiness.-Openscienceshouldbesupportedthroughandviasharedinfrastructures.
2014 Bridgingthegap.Howdigitalinnovationcandrivegrowthandcreate
P.Hofheinz,M.Mandel
-Thisdocumentoutlinesthegrowthofdataataglobalscale,
Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices.
-LargerrevolutioninalleconomicsectorsandaspectsofsocietallifearepredictedtobeduetotheBigDataera,butmostimportantlythesewillbeshapedbydataanalyticsforthis
52http://www.meta-net.eu/vision/reports/meta-net-sra-version_1.0.pdf53http://e-irg.eu/documents/10920/11274/annex_5.2_e-irg_white_paper_2013_-_final_version.pdf
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
32
jobs.LisbonCouncilSpecialBriefing.54
andaspectsofdata-driveneconomylikebetterconditionsforcreatinganinclusiveeconomyforallsocietymembers,betterpotentialforcheaperindividualeducation,andtrainingforadvancedmanufacturing.-Theauthorsfocusonthe‘datagap’betweenEuropeandNorthAmerica.-StrongfocusisgiventothediscussiononpolicyregulationsthatneedtobeadjustedacrosstheAtlantic,bothintheEUandtheUSA.
data.-EuropeisbehindcountrieslikeSouthKorea,US,Canada,intermsofdatause.-Therearesuccessfulexamplesofopenaccesspolicyintroductionforpublicdata.InEstonia,thestatereleasedpubliclyowneddataincludingutilitiesandcellphonenetworksafterhavingstrippedprivateinformationandanonymization,whichhadpositiveimpactonthedevelopmentofdata-drivenbusinesses.-TheauthorsarguethatiftheEuropeandataprotectionstatutesarenotlightened,thecostofdatausagewillonlyrise,whichconsequentlywillwidenthegapbetweenEuropeandtheUSA,aswellastherestoftheworld.
2014 MappingTextandDataMininginAcademicandResearchCommunitiesinEurope.LisbonCouncilSpecialBriefing.55
S.Filippov -Thisreportcontinuestheabove-mentionedworkongatheringinformationontheimportanceofTDMfortheglobaleconomy
Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices,aswellasinterviews,andbibliographyandpatentbasedanalysisonScienceDirectdatabase.
-Thedifferenceincopyrightaccessacrossthecountriesisreflectedbythenumberofpublicationsontextanddataminingsubjects.TheUSisleadingthefield,whiletheonlyEuropeancountryatthetopofthelististheUK,whichhasanexceptionforcontentminingforacademicpurposes.-ThenumberofpublicationsonTDMinEnglishisseveral
54http://www.progressivepolicy.org/wp-content/uploads/2014/04/LISBON_COUNCIL_PPI_Bridging_the_Data_Gap.pdf55http://www.lisboncouncil.net/publication/publication/109-mapping-text-and-data-mining-in-academic-and-research-communities-in-europe.html
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
33
development,andfocusesonthereasonswhyEUlagsbehindtheUSandsomeAsiancountriesinthisdomain,andhypothesiseshowthiscanchangeeitherpositivelyornegatively,dependingonthemeasurestaken.
ordershigherthaninanyotherlanguages,covering97%ofthepublications.-TDMispatentedasalgorithmsintheUSandotherjurisdictions,whileinEuropeTDMispatentedasbeingembeddedinthesoftware.Chinaisrisingstronglyinthismarketintermsofpatents.
2014 Standardisationintheareaofinnovationandtechnologicaldevelopment,notablyinthefieldofTextandDataMining.ReportfromtheExpertGroup,EuropeanCommission.56
ExpertGroupI.Hargreaves,L.Guibault,C.Handke,P.Valcke,B.Martens,R.Lynch,S.Filippov
-ThisreportliststhestakeholdertypesandthelegalandeconomicissuesthattheyencounterbothintheEUandintherestoftheworld.-TheexpertshypothesiseonhowthevaluechainmightbeadjustedwhenTDMisfullyincorporatedandlegallyallowed,arguingthatitshouldhaveamorepositiveimpactoverall.Thedataiscategorisedaccordingtobothsizeandlegalaccesspattern.
2014 AssessingtheeconomicimpactsofadaptingcertainlimitationsandexceptionstocopyrightandrelatedrightsintheEU.Analysisofspecificpolicyoptions.57
CharlesRiverAssociates(CRA)J.Boulanger,A.Carbonnel,R.DeConinck,G.Langus.
Thisreportfocusesoneconomicimpactsofspecificpolicyoptionsinseveraltopicsofinterest(digitalpreservationofandremoteaccessto
-IntroductionofIPexceptionsforthecaseofTDMforscienceshouldnotsignificantlyadverselyaffectrightholders’incentivesforcontentcreation,contentqualityandTDM-specificinvestments.
56http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_expert_group-042014.pdf57http://ec.europa.eu/internal_market/copyright/docs/studies/140623-limitations-economic-impacts-study_en.pdf
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
34
culturalheritage,e-lendingforlibraries,TDMforscientificpurposes),inviewofprovidingpolicyguidanceonthesetopics.
2014 Studyonthelegalframeworkoftextanddatamining(TDM).58
DeWolf&PartnersfortheEuropeanCommission
ThefocusisonthelegalaspectsofTDM.
Dataanalysisisthemostencompassingtermandismorepreferableforuseinalegalcontext.Classificationof4differentlevelsofaccess:“alltoall”(web-date)/“manytomany”(socialnetworks)/“onetomany”(contractualclauses)/“onetoone”(confidentialagreements)
2014 TheDataHarvest:Howsharingresearchdatacanyieldknowledge,jobsandgrowth.59
RDAEuropeF.Genova,H.Hanahoe,L.Laaskonen,C.Morais-Pires,P.Wittenburg,J.Wood
ThereportfocusesonthemeasuresthatshouldbetakeninordertosupportresearchusingBigData.
-Mainincentivesofthereportinclude:--dataplanmanagement;--promotionofliteracyintermsofalltheaspectsofBigDatauseacrosssociety;--developtoolsandpoliciesthatsupportdatasharing.
2015 OECDScience,TechnologyandIndustryScoreboard2015.INNOVATIONFOR
OrganisationforEconomicCo-operationandDevelopment,
-Thisreportaimstoprovidepolicymakersandanalystswiththemeanstocompare
-Thisisabroadlyscopedreportthathassectionsthatdiscusstrendsandfeaturesofknowledgeeconomies.-ItgivesperspectivesontheoverallscientificexcellenceofEUcountriescomparedtotherestoftheworldacrossallfieldsof
58http://ec.europa.eu/internal_market/copyright/docs/studies/1403_study2_en.pdf59https://rd-alliance.org/sites/default/files/attachment/TheDataHarvestFinal.pdf
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
35
GROWTHANDSOCIETY.60
Europe
economieswithothersofasimilarsizeorwithasimilarstructure,andtomonitorprogresstowardsdesirednationalorsupranationalpolicygoals.-Themainfocusofthisreportisontheknowledge-basedeconomies’developmenttrends.
science,andtheamountofinvestmentindifferentdomains,withspecialhighlightsonthenewgenerationofICT-related,BigDatabasedtechnologiesandpatents.-Acrossmanymetrics,EuropeancountrieswhencombinedarecomparabletotheresultsofUSA,Japan,China,SouthKorea.
2015 Skillsofthedatavores.TalentandtheDatarevolution.61
Nesta,UKJ.Mateos-Garcia,H.Bakhshi,G.Windsor
IntroducesitsownclassificationofthecompaniesthatworkwithBigData,andthenanalysesthetypesofskillsdatascientistsandengineershavetolearnandmastertoboosttheeconomyandscientificdevelopment.TheanalysisfocusesontheUKmarketwhilethecompany
InternalNESTAanalysisofthedataanalyticsmarketandhiringstrategiesandchallengesinthefieldthatarebasedonmorethan50in-depthinterviews(inthefieldsofmanufacturing,retail,ICT,creativemedia,financialservices,pharmaceuticals),collaborationwithCreativeSkillset.
-Classificationofcompaniesconnectedtodataprocessing:--‘Data-active’companiesthatareeitherdata-driven(Datavores),workwithlargevolumesofdata(DataBuilders),orcombinedatafromdifferentsources(DataMixers),--Dataphobes:othercompaniesthatrestrictthemselvestosmallsizedatasets.-Thesedifferenttypesofcompanieshavedifferentbehaviourswhilesettinguptheirrequirements,andthenhiringthedataexperts.Forexample,datavoresanddatabuildersrecruitmostoftheircandidatesfromComputerSciences,Engineering,andMathematics,whileBusinessdisciplinesarethesourceformanagementanalyticspositions.
60http://www.oecd.org/science/oecd-science-technology-and-industry-scoreboard-20725345.htm61https://www.nesta.org.uk/sites/default/files/skills_of_the_datavores.pdf
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
36
classificationcanbeusedglobally.
2015 BriefingPaper.TextandDataMiningandtheNeedforaScience-friendlyEUCopyrightReform.62
ScienceEurope Summarisestheargumentsinthedebateforcopyrightlawreformthatshouldensurethatlegally-accessedcontentcouldbefreelyminedwithoutadditionalpermissionandcostwithintheEU.
Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices.
-EUregulationsfordatabaseprotection(Directive1996/9/EC)seriouslyimpederesearcherswhoaimtoaccessthecontenttobeminedviaadatabase,asdownloadingasubstantialpartofthedatabasecontentneedstobeauthorisedbythedatabaseowner,evenifnoneofthecontentisundercopyright.-Scientificpublishersstarttointroducelicencesregulatingtheminingofsubscribedcontent,whichfurtherlimitstheresearchersintheirimmediateinteractionwiththecontent.-ExceptionsincopyrightlawincountrieslikeUK,Japan,USA,whichallowtheirrepresentativestoavoiddifficultieswiththelegalityofcontentmining,whichputsEUresearchersatadisadvantage,andcomplicatespotentialforinternationalcollaborations.-Theauthorssuggestthefollowingstepsforresearchorganisations:--AvoidrequestingthattheirresearchersabstainfromTDM;--RefusetosignTDM-licensesaspartofsubscriptioncontracts,ornegotiatemorefavourablecondition;--EmpowertheiremployeesinternallythroughtrainingandTDMsupport,andexternallybyadvocatingcopyrightchangesatnationalandEuropeanlevel.
2015 IsEuropefallingbehindinDataMining?Copyright’sImpacton
C.Handke,L.Guibault,J.-J.Vallbe
Thispaperfocusesonthecopyrightarrangements
Theauthorsmeasureresearchoutputinthenumberofpublicationsinacademicjournal
DemonstratesthatresearchersfromtheEU/EEAcountrieshaveweakerperformanceintermsofpublicationsinthegrowingfieldofdatamining.Thisispartiallyduetothecomparatively
62http://www.scienceeurope.org/uploads/PublicDocumentsAndSpeeches/WGs_docs/SE_Briefing_Paper_textand_Data_web.pdf
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
37
DataMininginAcademicresearch.63
impactingdataminingbyacademicresearchers.
articlesinThomsonReuters’sWebofScience(WoS)database.
strongcopyrightprotectionlawsinthesecountries.
2015 ScienceInternational:OpenDatainaBigDataWorld.64
InternationalCouncilforScience(ISCU),InternationalScienceCouncil(ISSC),TheWorldAcademyofSciences(TWAS),InterAcademyPartnership(IAP)
-ThisreportfollowsanaccordofInternationalScientificorganisationsandfocusesontheopportunitiesthattheBigDatarealityrepresents.-Itdiscusseshowtextanddataminingofthisscientificcontentonamassscalecanhelpwhenthescienceaddressesglobalchallengessuchasinfectiousdisease,energydepletion,migration,sustainabilityandtheoperationoftheglobaleconomyingeneral.
Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices.
-Unprecedentedopportunitiesforscientificdevelopmentincurrentdigitalagerelyon2pillars:BigData(largevolumesofcomplexanddiversedatastreaminginrealtime)andLinkedData(separatedatasetslogicallyrelatedtoaparticularphenomenonallowcomputerstoinferdeeperrelationships).-Forthescientists,theopenaccesstothescientificdataandpublicationsisamuch-neededimperative.Therefore,authorsinsistonderegulatingopendataforscientificdatausage.Datashouldbe“intelligentlyopen”,i.e.discoverableontheWeb,accessibleandintelligibleforfurtheruse,andassessableintermsofdataproducers’quality.Thisopennesswillalsohelpreplicabilityteststhattestandscrutinisethereportedresults.-Theauthorsnotetheongoingdiscussionwhetherpublicdatashouldbefreelyavailabletoeveryone,orjusttothenon-profitsector.Theyarguethatitisprimarilynotappropriateorproductivetointroducethisdiscriminationindataaccess,andmoreimportantly,robustevidenceisaccumulatingtosupporttheclaimthatthisaccesstodataforeveryonebringsdiversebenefits,andbroadereconomicandsocietalvalue.-Theauthorsgiveexamplesofcross-countrycollaborationsinopenaccessinitiativesinSouthAmericaandAfrica,whilenotingthatsharingdataallowsmoreuniformdevelopmentoftechnologiesandtheirimplementationonaglobalscale.
63http://elpub.architexturez.net/system/files/15_HANDKE_Elpub2015_Paper_23.pdf64http://www.icsu.org/science-international/accord/open-data-in-a-big-data-world-short
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
38
2016 TextandDataMining:TechnologiesUnderConstruction.65
Outsellmarketperformancereport,USAD.Auclair
-FocusofthisreportisonTDMaspectsforscientificresearch,healthcare,andengineering.
Theauthorsbasethereportonapproximately20in-depthinterviewswitharangeofstakeholderslikecontentvendorsandtechnologyandtoolproviders,aswellasonpreviouslypublishedinformation.
-Theauthorshighlightthefactthatdatamininghasbeenusedlongerformarketresearchandcommercialactivities,ascorrespondingcommercialwebsitescontainlargequantitiesofstructureddata,whileacademiaandindustryhavetodealwithhighvolumeofunstructuredtextsanddata,whichmadetheirprogressinuptakeslower.-ThereportprovidesalistofprovidersofTDM-relatedfunctionsandexamplesoftypesofbusinessthattheyareserving.-TheauthorslistanumberofactionsforcontentprovidersandTDMtoolvendorstofollow,thatcouldhelpthemtobenefitfromTDMuptakeinthenearfuture:--TDMsolutionsshouldbescalable--Storeddatashouldfollowthestandardsatcollectionandstoragestages.--Privacyandsecurityregulationsaretobeensured.
2016 BigData.AToolforInclusionorExclusion?UnderstandingtheIssues.66
FederalTradeCommission(FTC)Report,USAE.Ramirez,J.Brill,M.K.Ohlhaussen,T.McSweeny
-ThisreportarguesthatitisnotaquestionofwhethercompaniesshoulduseBigDataanalytics,buthowtoitthatcorrectly,minimisinglegalandethicalrisks.-HighlightsneedtoidentifyandavoidpotentialbiasesandinaccuraciesinTDM
FTCheldapublicone-dayworkshop‘BigData:AToolforInclusionorExclusion?’InordertogetafeedbackfrominvolvedstakeholdersonthepotentialofBigDataintermsofcreatedopportunitiesforconsumersandpolicyissues,namelysecuritybreaches,andbiasesandinaccuraciesinproperpopulation/eventsrepresentationinthedata.
-ThereportincludesalistofquestionsthataretobeconsideredandinvestigatedwhencompaniesareusingBigDataanalyticsintheirbusinessmodels.Thisprocedurecouldhelpthemtoavoidviolatinglawsintermsofdiscrimination,andtopreventerrorsthatmightoccurduetotrainingdatasetinconsistenciesthatcouldleadtopotentialbiasesinoffersandservices.Listofquestions:-Howrepresentativeisyourdataset?-Doesyourdatamodelaccountforbiases?-HowaccurateareyourpredictionsbasedonBigData?-DoesyourrelianceonBigDataraiseethicalorfairness
65http://www.copyright.com/wp-content/uploads/2016/01/Outsell-Market-Performance-22jan2016-Data-and-Text-Mining-Licensed.pdf66https://www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-exclusion-understanding-issues/160106big-data-rpt.pdf
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
39
applications,sothattheBigDataisusedtocreatenovelopportunitiesforsocietalimprovements,andespeciallytosupportlow-incomeandunderservedcommunities.
concerns?
2016 EuropeanParliamentresolutionon‘Towardsathrivingdata-driveneconomy’.67
EUParliamentJ.BuzekonbehalfoftheCommitteeonIndustry,ResearchandEnergy
Thisdocumentaimstostructuretheknowledge-basedeconomylandscape,withafocusontheEU,andtooutlinewhatareurgentmeasuresneedtobetakeninordertobetheleadersinthefieldataglobalscale.
Basedonanalysisofpreviouspublications,reports,andinternallyproducedfinancialestimationsofthemarketpotential.
-ThisresolutionoutlinesBigDataanddata-driveneconomypotentialtill2020,withafocusontheEuropeanissuesandneeds.-AnnouncementoftheEuropean‘FreeFlowofData’initiative-StressestheneedforamodernICTinfrastructureataEuropeanscale.
67http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
40
5.CONCLUSION
Theuptakeoftextanddataminingisgrowingacrossallsectorsofknowledge-basedeconomies,asthestrengthsofTDManditseconomicadvantages,suchastimesavingandscaling,areincreasinglywellrecognised(ICSU,ISSC,TWAS,IAP,2015),(AuclairD.,2016).WhilethedevelopmentofTDMbythe larger IT companies is directed at global markets, there is a long tail of TDM research anddevelopment in industry aimedat specificmarkets andniches.MostTDM industriesmakesuseofprovenTDMtechnologies,suchassupervisedmachine-learningalgorithmsforclassificationorrisksassessment(Ittooetal.,2016),(FleurenandAlkema,2015),howeverthereisahealthyrelationshipbetween industry and academic research on the development and uptake of newmethods fromfieldslikeArtificialIntelligence(e.g.deeplearningalgorithms).
Ingeneral,industrieshavenotroublefindingaccesstodata,andmanagetouseTDMforeconomicprofit.However,different typesofbarriersstilloccur. Ifbarriers to theavailabilityof full textsanddatasetsfromacademicdomainsandpublicsectorinformationwerelifted,thefirstapplicationsthatcouldbemadepossiblewouldfinallyreviveoldvisionsfromthesecondhalfofthe20thcenturyonaccomplishing automatic unconstrained scientific discovery from observational data (Simon et al.,1981),(Simonetal.,2007).
Large-scaleglobalcompanieslikeAlphabet(Google), IBM,andMicrosoft,currently investheavily inTDM algorithms and infrastructure development, as much of the value they accrue nowadays isleveragedfromthedataprovidedtothembytheircustomers,incombinationwithallthecrawlablecontent on the web. For companies and research institutes based in the European Union, acompetitivestrategynecessarilyinvolvesawideruptakeofTDM,basedonabetterawarenessofthepotentialofTDMacrossalleconomicsectors.
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
41
REFERENCES
Agrawal,R., Imielinski,T.,Swami,A.,1993.Miningassociationrulesbetweensetsof items in largedatabases, in: Proceedings of the 1993 ACM SIGMOD International Conference onManagementofData.ACM,Washington,D.C.,USA,pp.207–216.
Auclair D., 2016. Text and Data Mining: Technologies Under Construction. (Outsell marketperformancereport).
Baeza-Yates, R.A., Ribeiro-Neto, B., 1999.Modern InformationRetrieval. Addison-Wesley LongmanPublishingCo.,Inc.,Boston,MA,USA.
Cohen,W.W.,1995. Learning to classifyEnglish textwith ILPmethods.Advances in inductive logicprogramming32,124–143.
DiakopoulosN.,2014.Algorithmicaccountability.Journalistic investigationofcomputatiionalpowerstructures.DigitalJournalism3,398–415.
Filippov,S.,2014.MappingTextandDataMininginAcademicandResearchCommunitiesinEurope.FlaounasI.,AliO.,Landsdall-WelfareT.,DeBieT.,MosdellN.,LewisJ.,CristianiniN.,2013.Research
methodsintheageofdigitaljournalism.Massive-scaleautomatedanalysisofnews-content-topics,style,gender.DigitalJournalism1,102–116.
Fleuren,W.W.M.,Alkema,W.,2015.Applicationoftextmininginthebiomedicaldomain.Methods74,97–106.doi:10.1016/j.ymeth.2015.01.015
Hargreaves I., Guibault L., Handke C., Valcke P., Martens B., Lynch R., Filippov S., 2014.Standardisationintheareaofinnovationandtechnologicaldevelopment,notablyinthefieldofTextandDataMining.ReportfromtheExpertGroup.
Hearst,M.A., 1999.Untangling TextDataMining. CollegePark,Maryland, Proceedingsof the37thAnnual Meeting of the Association for Computational Linguistics on ComputationalLinguistics(ACL’99)3–10.
ICSU, ISSC, TWAS, IAP, 2015. Science International: Open Data in a Big DataWorld. InternationalCouncil for Science, International Science Council, The World Academy of Sciences,InterAcademyPartnership.
Ittoo, A.,Nguyen, L.M., van denBosch, A., 2016. Text analytics in industry: Challenges, desiderataandtrends.ComputersinIndustry.doi:10.1016/j.compind.2015.12.001
Joachims,T.,1998.TextCategorizationwithSuportVectorMachines:LearningwithManyRelevantFeatures, in:Proceedingsof the10thEuropeanConferenceonMachineLearning.Springer-Verlag,pp.137–142.
Lewis,D.D.,1998.Naive(Bayes)atForty:TheIndependenceAssumptioninInformationRetrieval,in:Proceedingsofthe10thEuropeanConferenceonMachineLearning.Springer-Verlag,pp.4–15.
Lewis S.C., Westlund O., 2014. Big data and journalism: Epistemology, expertise, economics, andethics.DigitalJournalism3,447–466.
Mayer-Schoenberger,V.,Cukier,K.,2013.BigData:ARevolutionThatWillTransformHowWeLive,WorkandThink.JohnMurrayPublishers.
MotionforaresolutionfurthertoQuestionforOralAnswerB8-0116/2016pursuanttoRule128(5)oftheRulesofProcedureon“Towardsathrivingdata-driveneconomy”(2015/2612(RSP),2016.
OCDE, 1996. The knowledge-based economy. Organisation for Economic Co-operation andDevelopment,Paris.
RamirezE.,BrillJ.,OhlhaussenM.K.,McSweenyT.,2016.BigData.AToolforInclusionorExclusion?UnderstandingtheIssues.(FederalTradeCommission(FTC)Report).
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
42
Simon,H.A., Langley,P.W.,Bradshaw,G.L., 1981. Scientificdiscoveryasproblemsolving. Synthese47,1–27.doi:10.1007/BF01064262
Simon,Uren,V.,Li,G.,Sereno,B.,Mancini,C.,2007.Modelingnaturalisticargumentationinresearchliteratures:Representationandinteractiondesignissues:ResearchArticles.Int.J.Intell.Syst.22,17–47.doi:10.1002/int.v22:1
Swanson, D.R., 1986. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. PerspectBiolMed30.
vanHaagen,H.H.H.B.M.,’tHoen,P.,Bovo,A.B.,deMorrée,A.,vanMulligen,E.,Chichester,C.,Kors,J.,denDunnen, J.,vanOmmen,G.J.B.,vanderMaarel,S.,Kern,V.M.,Mons,B.,Schuemie,M., 2009.Novel protein-protein interactions inferred from literature context. PLoSONE 4.doi:10.1371/journal.pone.0007894