Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from...

37
UKSG Conference April 2013 Phil Nicolson

description

By Phil Nicholson. Presented at UKSG, April 2013.

Transcript of Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from...

Page 1: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

UKSG Conference April 2013

Phil Nicolson

Page 2: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Data Governance

What is Data Governance

What is Data Quality

The challenges

Data governance programme

A publisher approach

The outcome: Book author example

ICEDIS

Summary

Page 3: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Data governance“I think that the key issue here, is that the information is probably incorrect, inaccurate and in a form that almost certainly shouldn't have been used”

Dr John Thomson cardiologist at Leeds General Infirmary,

Sky News 30/3/2013

Page 4: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Data Governance – a definition

Data governance is defined as the processes, policies, standards, organisation, and technologies required to manage and ensure the availability, accessibility, quality, consistency, auditability, and security of data

Page 5: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Data Quality - definitions Data are of high quality "if they are fit for their intended uses

in operations, decision making and planning"

Data are deemed of high quality if they correctly represent the real-world construct to which they refer

Page 6: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Data Quality Data quality attributes:

Accurate

Reliable

Complete

Appropriate

Timely

Credible

Up-to-date

Page 7: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

The challenge: Data Sources Multiple data sources – ‘system’ data silos

Multiple locations – ‘geographic’ data silos

Data entered through multiple channels

Data entered by different people

Page 8: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

The challenge: Data SourcesTypical publisher systems: Data can be entered by:

Financial system

CRM/Sales database

Authentication system

Fulfilment

Usage statistics

Submissions system

Author database

…..

Organisation staff

Authors

Society members

Agents in the supply chain

3rd party organisations

…..

Page 9: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

The challenge: Institutions UCL:

University College London (UK) Université Catholique de Louvain (Belgium) Universidad Cristiana Latinoamericana (Ecuador) University College Lillebælt (Denmark) Centro Universitario Celso Lisboa (Brazil) Union County Library (USA)

NPL: National Physical Laboratory (UK) National Physical Laboratory (India)

York Uni. University of York (UK) York University (Canada)

Northeastern University: Northeastern University (Boston, USA) Northeastern University (Shenyang, China)

Page 10: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

The challenge: IndividualsHow can we uniquely identify individuals? Of the 700,000 individuals known to the RSC in 2012 there were:

Smith: ~1,500

Jones: ~1,000

Li: >10,000

Page 11: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Consequences of poor data

Page 12: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Biggest obstacle(s) to data quality improvement in your organization?

Lack of accountability and responsibility for data quality 55.4%

Too many information silos 51.8%

Lack of awareness or communication of the magnitude of data quality problems 51.4%

Lack of common understanding of what data quality means 50.2%

Lack of awareness or communication of the opportunities associated with high quality data 45.0%

Lack of senior leadership in tackling data quality issues 44.2%

Lack of data quality policies, plans, and procedures 42.2%

Perception that data quality is an IT issue only rather than an organisation wide issue 41.8%

The State of Information and Data Quality 2012 Industry Survey& Report, (IAIDQ) Understanding how Organizations Manage the Quality of their Information and Data Assets.Pierce, Yonke, Malik, Nagaraj

Page 13: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Data Governance – why it is vital“processes, policies, standards… ensure quality and consistency”

Increase consistency and confidence in our decision making

Maximise the income generation potential of our data

Provide excellent customer service

Designating accountability for information quality

Minimising or eliminating re-work

Optimise staff effectiveness

Decreasing the risk of regulatory fines

Improving data security

Data is one of the most valuable assets within an organisation

Page 14: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Data governance – a new culture

Page 15: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Data governance programme

Page 16: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Plan & prioritise Sponsorship: director level sponsor?

Program management: business or IT driven?

Organisational structure: local, national, international?

Scope: focus on the most important data?

Ownership: who are the business owners of critical data?

New system implementation: protect investment

Page 17: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Plan & prioritise Resources: dedicated staff?

Funding: which area of the business will fund the program?

Business drivers: what are the major business drivers?

Barriers: what are the main barriers (cultural, funding, resources, priorities etc.) and can they be mitigated

Page 18: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Audit & Analyse Audit existing data quality

Review all relevant systems

How poor is it?

Incomplete data

Invalid

Out of date

….

Page 19: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Clean existing data Prioritise

Quick wins

Highlight progress

What can be automated?

Introduce unique identifiers

Page 20: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Identifiers available People

International Standard Name Identifier (ISNI)

Open Researcher and Contributor ID (ORCID)

Scopus Author Identifier

ResearcherID

Organisations

International Standard Name Identifier (ISNI)

Ringgold ID

DUNS Number (D&B) and other business and finance IDs

MDR PID Numbers and other marketing IDs

Library of Congress MARC Code List for Organizations

Page 21: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

ISNI

ISNI Number ISNI Number

Party ID 2Party ID 1

Proprietary Information and/or

Metadata

Proprietary Information and/or

Metadata

ISNI is designed to be a “bridge identifier”

Page 22: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Author IDs ORCID is designed to persistently identify and disambiguate

scholarly researchers and attach them to research output

ORCID identifiers utilize a format compliant with the ISNI ISO standard

ISNI has reserved a block of identifiers for use by ORCID, so there will be no overlaps in assignments

Recorded as http://orcid.org/0000-0001-2345-6789

http://about.orcid.org/

http://www.isni.org/

Page 23: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Use cases Disambiguation of researchers

and connection to all their research

Links to contributors, editors, compilers and others involved in the research process

Embed IDs into research workflows and the supply chain

Integrate systems

Page 24: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Institutional IDs Ringgold is an ISNI Registration Agency

Unique institutional ID number maps data across systems

ISNI numbers should be used across the scholarly supply chain to:

Disambiguate institutional records

Eradicate duplication of data

Map institutions into their hierarchy

Link systems using the institutional ID as the lynchpin

Page 25: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Minimising the impact of data silos Standard identifiers (both individual and institution) can be

used to breakdown silos by enabling better system linking:

Page 26: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Improve data capture Data quality policy

Web forms

Closer collaboration with 3rd parties to encourage use of industry standard identifiers such as ISNI or ORCID

Page 27: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Data capture - data quality policy Design to ensure accuracy, quality and consistency

Individual responsibilities: All staff are responsible for the accuracy and consistency of data

Capture data in such a way that it is uniquely identifiable and easily shared within the organisation and with 3rd parties

Records relating to individuals

Records relating to institutions

Reporting of inaccuracies to Data Owners

Data owners responsibilities: All source data systems must have a designated Data Owner

Data owner retains overall responsibility for all records within their source data system

Page 28: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Improve data capture – web forms Required fields

Validation

Address validation – postcode lookup

Institution validation – institution lookup

‘Internal’ and ‘external’ web form consistency

Language barriers

Help and hints

Free-text fields

Page 29: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

On-going monitoring Dashboards

Regular audits

Metrics – Institutional Linking Rate

Staff awareness

Reporting of errors

Page 30: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

A publisher example Develop a Data Governance Programme

Data ‘champion’

Engagement – at all levels

Ownership – at all levels

Allocate necessary resources

Guidelines/Policy - Data quality policy

Processes put in place

Education - raise awareness

New staff – training on Data Governance and their wider impact

Change of culture

Page 31: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

A publisher example Ringgold and DataSalon client

All institutional records contain Ringgold Identifiers

System linking via Individual and Institutional identifiers

Data (both good and bad) visible to all via MasterVision

Use of data governance dashboards

Tidying of existing data

Simple reporting of incorrect data across organisation

New data captured correctly

Page 32: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Author database

1. Create a data governance dashboard to monitor problem areas:

• Book authors with no related institution• Unknown book authors• Author records without an affiliation entry• Author records with commas in the

affiliation entry• Book authors without an email address• Book authors with an invalid email address

2. Correct problem records in existing data• Dashboard clearly highlighted all records of

concern and these records were corrected

Page 33: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Author database3. Ensure new records are created correctly

• Raise staff understanding of the importance of capturing data correctly and the impact it has across the organisation as a whole (data silos)

• Training covering data governance

4. Ensure appropriate Ringgold coverage• Where institutions were discovered in the Author database that didn’t exist

within Identify these were reported to Ringgold. This not only means that individual authors can be linked to the new institution but that any individuals in other data sources at the same institution can be linked. This benefits all users of our data and potentially highlights new sales opportunities.

5. Monitor data quality on an on-going basis• Books data governance dashboard update on a weekly basis.

Page 34: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Author database – results

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

100.00%

All data sources

ANKO

10% will never link:• Missing data (old records)• Institution no longer exists • Retired author• Genuinely no related institution

End of process: • 15% increase in authors linked to

institutions - information valuable in supporting all areas of the business

• Ready for data migration

Page 35: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

ICEDIS The international standards organization EDItEUR is working to

encourage improvements in the ways that "party" information is communicated

Some parts of the supply chain continue to send unstructured name & address records, making matching, disambiguation and automatic ingest near impossible

ICEDIS has collaborated with EDItEUR to develop a highly structured data model for exchanging names, addresses and standard identifiers.

The group has recently been validating the model by means of a "paper pilot", using a small library of about 100 name & address types

An XML schema and HTML documentation are freely available

www.editeur.org www.editeur.org/138/[email protected]

Page 36: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Summary Your data is a very valuable asset when managed correctly

Establishing a data governance programme will enable you to gain maximum benefit from that data

Data governance is as much about changing the culture of an organisation as it is about processes and procedures

It will take time but the benefits can be enormous

Page 37: Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data

Phil Nicolson

Data Manager

Ringgold Inc.

[email protected]