Konrad cedem praesi

Post on 07-May-2015

466 views 2 download

Transcript of Konrad cedem praesi

Assessment and Visualizationof Metadata Qualityfor Open Government Data

Konrad Johannes Reiche*, Edzard Höfig, Ina Schieferdecker**, presented by Nikolay

Tcholtchev**konrad.reiche@gmail.com*,

{firstname.lastname}@fokus.fraunhofer.de**

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.”

O·pen Da·ta /ˈəʊp(ə)n ˈdeɪtə/

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-like.”

O·pen Da·ta /ˈəʊp(ə)n ˈdeɪtə/

License

Government

Data Citizens

DOMAIN

Government

Data Citizens

DOMAIN

DESIGN

Repositories

XML

JSON

RDF

Metadata

PDF XLS CSVDOC

Resources

Quality.What could possibly go wrong?

Metadata Record

Name regional-household-income

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer Margaret Jarmon

Maintainer Email magaret.jarmon@cabinet-office.x.gsi.gov.uk

Author Office for National Statistics

Author Email webmaster@cabinet-office.x.gsi.gov.uk

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ ons/ rhi14

Description Spring 2014

Format CSV

Quality.What could possibly go wrong?

Metadata Record

Name regional-household-income

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer

Maintainer Email

Author Office for National Statistics

Author Email

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ ons/ rhi14

Description

Format CSV

Quality.What could possibly go wrong?

Metadata Record

Name regional-household-income

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer

Maintainer Email

Author Office for National Statistics

Author Email

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ ons/ rhi14

Description

Format CSV

CSV

HTML

Metadata Record

Name regional-household-income

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer

Maintainer Email

Author Office for National Statistics

Author Email

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ ons/ rhi14

Description

Format CSV

Quality.What could possibly go wrong?

CSV

Metadata Record

Name

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer

Maintainer Email

Author

Author Email

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ ons/ rhi14

Description

Format CSV

Quality.What could possibly go wrong?

CSV

Reputation Loss

QUALITY LOSSInformation Loss

- Missing Fields- Dead Links- Inaccurate

Information- False Information

- Outdated Values- Missing

Information- Bad Spelling- Non-Schema

CompliantBad Searchability Unreliable

Untrustworthy

Meta·da·ta Qual·i·ty/ˈmɛtədeɪtə kwɒlɪti/

The fitness to describe the data (resources), supporting the task dimensions of finding, identifying, selecting and eventually obtaining the resources. The quality is inversely proportional to the uncertainty of the user about the actual data.

Assessing Metadata Quality is HARDHighly

Subjective

Metadata

Resource

?

1. Manual 2. Automated

Wrong

Qualified ProcessPrinciples + Guidelines

Postulated as being not feasible anymore due to the large number of metadata records.

- Algorithms?- Procedures?- Oracle?- Machine

Learning?

Automated Quality AssessmentEmpirical Analysis + Visual Aid- Field Usage- Field Values

Framework- Based on Information

Quality- Three Dimensions:

- Intrinsic- Relational /

Contextual- Reputational

- Evaluation Criteria- Completeness- Accuracy- Provenance- Logical Consistency- Timeliness …

QUALITY METRICS

𝑞𝑚 :𝑟𝑒𝑐𝑜𝑟𝑑𝑡⟶𝑉∈ [0 ,1]

Measurement. Assigning a symbolic value to an object to enable the characterization of a certain attribute of that object.

Process P

Quality. Complex Attribute. No single measure. Highly Subjective. Use of Proxies.

Completeness. How many fields have been completed?

Record contains all the information required to have an ideal representation of the described resource.

Metadata Record

Name uk-civil-service-high-earners

ID 68addaac-59ae-4230-bb67-c5a8f6a76285

Maintainer

Maintainer Email

Author Civil Service Capability Group

Author Email webmaster@cabinet-office.x.gsi.gov.uk

License ID uk-ogl

ResourcesSize 40959

Description Civil Servants Salaries 2010

Format CSV

Size

Description Civil Servants Salaries 2011

Format CSV

Weighted Completeness. Not all fields are equally relevant.

Weight value expresses the relative importance of field .

Metadata Record

Name uk-civil-service-high-earners

ID 68addaac-59ae-4230-bb67-c5a8f6a76285

Maintainer

Maintainer Email

Author Civil Service Capability Group

Author Email webmaster@cabinet-office.x.gsi.gov.uk

License ID uk-ogl

ResourcesSize 40959

Description Civil Servants Salaries 2010

Format CSV

Size

Description Civil Servants Salaries 2011

Format CSV

Accuracy. How accurate is the resource represented?

Semantic distance . Difference between the information a user can extract from the record and the resource.

Metadata Record

Name regional-household-income

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer

Maintainer Email

Author Office for National Statistics

Author Email

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ons/ rhi14

Description

Format CSV

CSV

HTML

Richness of Information. How much value is added?

𝑞𝑖 (𝑟𝑒𝑐𝑜𝑟𝑑 )=∑𝑖=1

𝑛

𝐼 ( 𝑓𝑖𝑒𝑙𝑑𝑖 )

𝑛

Vocabulary terms and descriptions should be meaningful. Information should be unique and not redundant.

𝑚Number of DocumentsNumber of Words

𝑛

Readability. How readable are the descriptions? Readable in terms of cognitive accessibility.

Flesch-Kincaid Reading Ease

Availability. Are the links working?

Metadata only links to the resources. Without working links the actual data is not available.

is true if the th resource is reachable through the URL.

Implementation.

Metadata Census

REQUIREMENTS

Metadata HarvesterSchemaless Data StoreQuality MetricsVisualizationLeaderboard

ScalabilityExtensibility

Non-functional

Functional

Repository

+ url : String

+ name : String+ type : Symbol

Snapshot

+ date : Date

MetaMetadata

+ metadata_record : Hash+ score : Float

+ statistics : Hash + completeness : Hash+ weighted_completeness : Hash+ richness_of_information: Hash...

+ latitude : String+ longitude : String + best_record() : MetaMetadata

+ worst_record() : MetaMetadata+ score() : Float

0..* 1..*

DESIGN.

CompletenessMetric

WeightedCompleteness

<<Interface>>

Metric

+ compute(record)

MetricWorker

+ perform(snapshot, metric)

GenericMetricWorker

CompletenessMetricWorker

OpennessMetric

<<use>>

<<use>>

<<use>>

Metadata Harvester

JSON JSON

JSON

Archives

API

Req

uests

Reco

rds

Imports

Persist

Metadata Census

Metadata Harvester

JSON JSON

JSON

Archives

API

Req

uests

Reco

rds

Preliminary Analyzer

Dump Importer

Database

Imports

Persist

Metadata Census

Metadata Harvester

JSON JSON

JSON

Archives

API

Req

uests

Reco

rds

Metric Processor

Query

Records

Scheduler

Analyzer

Preliminary Analyzer

Dump Importer

Database

ViewUser

Generates

Investigates

Imports

Persist

Metadata Census

Metadata Harvester

JSON JSON

JSON

Archives

API

Req

uests

Reco

rds

Metric Processor

Query

Records

Scheduler

Analyzer

Preliminary Analyzer

Dump Importer

Database

Open Government Data.

Evaluation

Implementation focused exclusively on CKAN repositories.

Rank RepositoryScor

e

Misspelling

Richness of Information

Openness

Completeness

Availability

Weighted Completeness

Readability

Accuracy

1 data.gc.ca 74 97 86 80 79 79 81 71 20

2 data.sa.gov.au 71 98 63 94 77 86 82 72 0

3 GovData.de 67 99 4 38 55 81 87 79 56

4 data.qld.gov.au 66 99 67 96 73 60 78 59 0

4 PublicData.eu 66 98 84 69 64 70 67 42 32

4 data.gov.uk 66 97 85 69 62 74 67 44 28

4 africaopendata.org 66 100 20 78 70 87 68 55 53

5 datos.codeandomexico.org 65 100 55 84 65 100 75 37 0

6 catalogodatos.gub.uy 63 100 64 1 70 74 78 65 52

6 data.openpolice.ru 63 100 0 0 58 100 81 100 64

7 dados.gov.br 61 100 87 36 53 57 72 44 39

8 opendata.admin.ch 59 100 12 0 58 100 68 35 100

9 data.gv.at 57 100 21 99 51 68 65 59 0

10 data.gov.sk 49 100 51 0 48 92 58 37 7

Conclusion

What is good about this approach?

Metadata quality is quantified, but every quality aspect on its own. Metric scores are aggregated to make it comparable.

Every additional quality metric is supposed to complete the quality puzzle.

Automated — Generic — Quantifiable — Repeatable

Platform has the advantage that it acts as a beacon...

If your metadata breaks bad everyone will see it.

What is bad not so good about this approach?

- Lacks number of quality metrics- No empirical analysis beforehand- Overvalues problems with the

metadata

More quality metrics are necessary. Current metrics need to consider more special cases in the metadata records.

Final Thought. Do not aim for excellence, aim for low-quality metadata.

Quality Feed. Monitor metadata changes live and record changes in a timeline.

Repository Support. There are more repository software with public APIs. Socrata being most prominent.

More Quality Metrics- Duplicate

Detection- Discoverability- Coherence- Advancement- Reputation

Metadata Revision System. Avoid storing whole snapshots, but the change set.

Domain-Specific Language. Make it even easier to add individual quality metrics.

DEMOmetadata-census.com