Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf ·...
Transcript of Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf ·...
ExtractingLinkedDatafromstatisticspreadsheets
Tien-Duc [email protected] Manolescu [email protected]
XavierTannier [email protected]
SemanticBigDataworkshop,Chicago,May19th,2017
Agenda
1. Context:datajournalismandjournalisticfact-checking
2. Researchproblem:extractinglinkedopendatafromspreadsheets
3. Approach
4. Results
5. Futurework
1Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
1.Fact-checkingisacontentmanagementproblem
19/05/2017Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 2
Claimtobechecked (text
ordata)Mediacontent
Mediacontext
Referenceinformationsource1
Human actors(journalists,experts,
crowd workers)
Referenceinformationsource2
Referenceinformationsourcen
Verification tool(query,match,sourcesearch…)
…
Analysis result« True /rather true /rather false/false
See sources:http://dataref.com… »
1.Fact-checkingisacontentmanagementproblem
19/05/2017Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 3
Claimtobechecked (text
ordata)Mediacontent
Mediacontext
Referenceinformationsource1
Human actors(journalists,experts,
crowd workers)
Referenceinformationsource2
Referenceinformationsourcen
Verification tool(query,match,sourcesearch…)
…
Analysis result« True /rather true /rather false/false
See sources:http://dataref.com… »
Claimextraction
Socialnetworkanalysis
Reconciliation,reputation
Sourced’informationderéférencen+1
Sourced’informationderéférencen+1
Referenceinformationsourcen+1
Sourcesearch /sourceselection
Referencesourceconstruction,refinement,integration
1.Context
• Whichdatasource canhelpustofact-checkastatisticalclaimfromthemedia?
• E.g:“TheunemploymentrateinFrancelastyearwas50%?”• ThisworkisapartofContentCheck 1 project
41 https://team.inria.fr/cedar/contentcheck/
Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
2.Researchproblem:high-qualityreferencedata
• NationalstatisticinstitutessuchasINSEE1,France’seconomicandsocietalstatisticsinstituteareoftenvaluabledataproviders
51 https://insee.fr/Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
http://abonnes.lemonde.fr/les-decodeurs/portfolio/2017/04/18/les-fractures-francaises-1-5-le-logement-les-raisons-de-la-crise_5112859_4355770.html
Existing houseprice indexAvailable revenueperheadRent indexConsumerprice index
2.Theroadtohighqualitydata…
6
UnfortunatelymostofthedatapublishedbyINSEElookslikethis(ourtextcoloring):
Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
2.Theroadtohighqualitydata…
7
Sometimestherearemorethan1tablepersheet
Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
3.Extractionapproach
8Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets"Imagesources:
https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128https://www.w3.org/RDF/icons/rdf_w3c.svg
19/05/2017
3.Extractionapproach
9Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets"Imagesources:
https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128https://www.w3.org/RDF/icons/rdf_w3c.svg
19/05/2017
3.Approach:findingtableboundaries
10Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
3.Extractionapproach
11Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets"
Imagesources:https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128https://www.w3.org/RDF/icons/rdf_w3c.svg
19/05/2017
3.Approach:tableextractor
12
• Headercellsmostly containtexts
• Theirpositionsareat:• thetop(headerrows)oftable• theleft(headercolumns)oftable
• Havingmorethan1headerrows/columnsindicatesdataaggregation
• Datacellsmostly containnumericvalues
Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
3.Approach:tableextractor
1. Wedistinguishheader/datarow/columnsusing• datatypeofitscells(text,number,specialvaluetoindicateamissingvalue,nullforemptycell)• formattinginformationofitscells:cell’sborder,cellsbelongtomergedcell• thetypesofitsneighborrows/columns
2. Basedontheseweidentifytheexactstructureofeachtable
13Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
3.Conceptualdatamodel
14Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
4.Results• Collected16011 Excelspreadsheets,extracted74117 tables.
• Accuracyevaluation:• Weselectedrandomly100Excelfilesà 2432tables• Wevisuallyidentifiedtheheadercells,datacellsandheaderhierarchyandthencomparedwiththoseobtainedfromoursystem.
15Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
16Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
4.SampleextractedRDF
5.Futurework
17Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
Referenceinformationsource1
Referenceinformationsource2
Referenceinformationsourcen
Verification tool(query,match,sourcesearch…)
Sourcesearch /sourceselection
Referencesourceconstruction,refinement,integration
Thanks/questions?
18Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017
ExcelfilesandextractedRDFfiles(10.5GBwillbeexpiredinMay29th 2017)https://goo.gl/4Y5Dtv
Sourcecode:noexpirationdate:)https://gitlab.inria.fr/cedar/insee-crawlerhttps://gitlab.inria.fr/cedar/excel-extractor