Towards a Community-driven Data Science Body of Knowledge – Data Management Skills and Competences
-
Upload
research-data-alliance -
Category
Data & Analytics
-
view
253 -
download
1
Transcript of Towards a Community-driven Data Science Body of Knowledge – Data Management Skills and Competences
Towards a Community-driven Data
Science Body of Knowledge
FAIR 2016 , Florence
14-15 November 2016
Andrea Manieri
Engineering Ingegneria Informatica S.p.A.
EDISON – Education for Data Intensive
Science to Open New science frontiers
Grant 675419 (INFRASUPP-4-2015: CSA)
Credits:
• Yuri Demchenko (UvA)
• Steve Brewer (SOTON)
• Kim Hee (GOETHE)
• Adam Belloum (UvA)
• Spiros Koulozis (UvA)
A sense of urgency – dated 2013
“Europe faces up to 700.000 unfilled ICT jobs and declining competitiveness. The number of
digital jobs is growing – by 3% each year during the crisis – but the number of new ICT
graduates and other skilled ICT workers is shrinking. Our youth need actions not words, and
companies operating in Europe need the right people or they will move operations
elsewhere”. EC press release 25, Jan 2013
Grand Coalition for Digital Jobs + EU eSkills strategy for 2020 becoming Digital Skills and
Jobs Coalition (conference launch 1st Dec 2016 in Bruxelles)
Data Scientist shortage:
- Gartner, 2012
- McKinsey, 2013
- Forbes, 2013 https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
Who need Data skills?
• As a student – I need recommendations
on Data-driven careers
• As a resercher – I need to cover gaps wrt
eScience
• As a Librarian – I want to promote my
competences
• As an employee – I want to reskill in data-
driven jobs
Who need Data skills?
• As Scholar/Lecturer
– I need to update my
background
• As training manager
– I want to innovate my
offering
• As course designer
– I have to define right
topics and know-how to
be taught
Who need Data skills?
• As HR manager
– I want to find the fit-for-
purpose candidates
• As team leader
– I need to cover know-
how and skills for a
task/project
• As employer
– I want to define re-skilling
plans for my workforce
Visionaries and Drivers
The Fourth Paradigm: Data-Intensive Scientific Discovery.
By Jim Gray, Microsoft, 2009. Edited by Tony Hey, et al.
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
Riding the wave: How Europe can gain from
the rising tide of scientific data.
Final report of the High Level Expert Group on
Scientific Data. October 2010.
http://cordis.europa.eu/fp7/ict/e-
infrastructure/docs/hlg-sdi-report.pdf
The Data Harvest: How
sharing research data
can yield knowledge,
jobs and growth.
An RDA Europe Report.
December 2014
https://rd-alliance.org/data-
harvest-report-sharing-data-
knowledge-jobs-and-growth.html
https://www.rd-alliance.org/
NIST Big Data Working Group (NBD-WG)
https://www.rd-alliance.org/ (since 2013)
ISO/IEC JTC1 Big Data Study Group (SGBD)
http://jtc1bigdatasg.nist.gov/home.php (2014)
EDISON & RDA
• 1st RDA Plenary meeting – 18-20 March 2013
– 1st BoF on Education and Skills Development in Data Intensive Science
– Attended by 16 representatives from universities, libraries, e-Science, data
centers, research coordination bodies
• 3rd RDA Plenary meeting – 26-28 March 2014, Dublin
– 3rd BoF on Education and Skills Development in Data Intensive Science
– EDISON (Education for Data Intensive Science to Open New science
frontiers) Initiative announced
• 4th RDA Plenary meeting – 22-24 September 2014, Amsterdam
– IG Education and Training on Handling of Research Data (ETHRD)
established
– EDISON Workshop – 21 Sept 2014, Science Park Amsterdam
– Decision to form a consortium and submit a proposal to IINFRASUPP-4-2015
call
• 8th RDA Plenary meeting – 15-17 September 2016, Denver, USA
– BoFs and IG meetings – now developing Certification and Accreditation proposal
EDISON Data Science Framework (EDSF)
CF-DS
DS-BoK MC-DS
Taxonomy and
Vocabulary
eLearning Platform
Datasciencepro.eu
Roadmap &
Sustainability
• Community
Portal (CP)
• Professional
certification
• Data Science
career & prof
development
DS Prof Profiles Data Science
Framework
Foundation & Concepts Services Biz Model
• EDISON Framework components
– CF-DS – Data Science Competence Framework
– DS-BoK – Data Science Body of Knowledge
– MC-DS – Data Science Model Curriculum
– DSP - Data Science Professional profiles definition
– Data Science Taxonomies and Scientific Disciplines Classification
• Based on the definition by NIST Big Data WG (NIST SP1500 -
2015)
• A Data Scientist is a practitioner who has sufficient knowledge in
the overlapping regimes of expertise in business needs, domain
knowledge, analytical skills, and programming and systems
engineering expertise to manage the end-to-end scientific method
process through each stage in the big data lifecycle – …Till the delivery of expected scientific and business value to science or
industry
• Other definitions to admit such features as – Ability to solve variety of business problems, tell “stories”, input to
decision making
– Optimize performance and suggest new services for the organisation
– Develop a special mindset and be statistically minded, understand raw
data and “appreciate data as a first class product”
Data Scientist Definition
• Data science is the empirical synthesis of actionable knowledge and technologies required to
handle data from raw data through the complete data lifecycle process.
• Big Data is the technology to build system and infrastructures to process large volume of
structurally complex data in a time effective way
[ref] Legacy: NIST BDWG
definition of Data Science
• Commonly accepted Data Science competences/skills groups include
– Data Analytics or Business Analytics or Machine Learning
– Engineering or Programming
– Subject/Scientific Domain Knowledge
• EDISON identified 2 additional competence groups demanded by
organisations
– Data Management, Curation, Preservation
– Scientific or Research Methods and/vs Business
Processes/Operations
• Other skills commonly recognized aka “soft skills” or “social/professional
intelligence”
– Inter-personal skills or team work, cooperativeness
• Important aspects of integrating Data Scientist into organisation structure
– General Data Science (and Data) literacy for all involved roles and management
– Common agreed and understandable way of communication and
information/data presentation
– Role of Data Scientist: Provide a kind of literacy advice and guidance to
organisation
Data Science Competence Groups
• Group 1: Skills/experience related to
competences
– Data Analytics and Machine Learning
– Data Management/Curation (both general
and scientific)
– Data Science Engineering (hardware and
software) skills
– Scientific/Research Methods or Business
Process Management
– Application/subject domain related (research
or business)
– Mathematics and Statistics
• Group 2: Big Data (Data Science) tools
and platforms
– Big Data Analytics platforms
– Mathematics & Statistics applications & tools
– Databases (SQL and NoSQL)
– Data Management and Curation platform
– Data and applications visualisation
– Cloud based platforms and tools
Data Science Skills/Experiences
Group 3: Programming and
programming languages and IDE
– General and specialized development
platforms for data analysis and statistics
Group 4: Soft skills or Social
Intelligence
– Personal, inter-personal communication, team
work, professional network
Comparing with relevant BoK
• ACM Computer Science Body of Knowledge (ACM CS-BoK)
• ICT professional Body of Knowledge (ICT-BoK)
• Business Analytics Body of Knowledge (BABOK)
• Software Engineering Body of Knowledge (SWEBOK)
• Data Management Body of Knowledge (DAMA-BoK) by Data
Management Association International (DAMAI)
• Project Management Professional Body of Knowledge (PM-
BoK)
• DS-BoK Knowledge Area Groups (KAG)
• KAG1-DSA: Data Analytics group including
Machine Learning, statistical methods,
and Business Analytics
• KAG2-DSE: Data Science Engineering group
including Software and infrastructure engineering
• KAG3-DSDM: Data Management group including data curation, preservation
and data infrastructure
• KAG4-DSRM: Scientific/Research Methods group
• KAG5-DSBP: Business process management group
• Data Science domain knowledge to be defined by related expert groups
Data Science BoK (DS-BoK)
Process Groups – knowledge at work
• Data Identification and Creation
– how to obtain digital information from in-silico experiments and instrumentations, how to collect and store in digital form,
any techniques, models, standard and tools needed to perform these activities, depending from the specific discipline.
• Data Access and Retrieval:
– tools, techniques and standards used to access any type of data from any type of media, retrieve it in compliance to
IPRs and established legislations.
• Data Curation and Preservation:
– includes activities related to data cleansing, normalisation, validation and storage.
• Data Fusion (or Data integration):
– the integration of multiple data and knowledge representing the same real-world object into a consistent, accurate, and
useful representation.
• Data Organisation and Management:
– how to organise the storage of data for various purposes required by each discipline, tools, techniques, standards and
best practices (including IPRs management and compliance to laws and regulations, and metadata definition and
completion) to set up ICT solutions in order to achieve the required Services Level Agreement for data conservation.
• Data Storage and Stewardship:
– how to enhance the use of data by using metadata and other techniques to establish a long term access and extended
use to that data also by scientists and researchers from other disciplines and after very long time from the data
production time.
• Data Processing:
– tools, techniques and standards to analyse different and heterogeneous data coming from various sources, different
scientific domains and of a variety of size (up to Exabytes) – it includes notion of programming paradigms.
• Data Visualisation and Communication:
– techniques, models and best practices to merge and join various data sets, techniques and tools for data analytics and
visualisation, depending on the data significant and the discipline.
Data Science Data Management Group
(DSDM)
KAG3-DSDM:
Data Management
group including
data curation,
preservation and
data infrastructure
DAMA-BoK selected KAs
(1) Data Governance
(2) Data Architecture
(3) Data Modelling and Design
(4) Data Storage and Operations
(5) Data Security
(6) Data Integration and
Interoperability
(7) Documents and Content
(8) Reference and Master Data
(9) Data Warehousing and Business
Intelligence
(10) Metadata
(11) Data Quality
General Data Management KA’s
Data Lifecycle Management
Data archives/storage
compliance and certification
New KAs to support RDA
recommendations and community
data management models (Open
Access, Open Data, etc.)
Data type registries, PIDs
Data infrastructure and Data
Factories
…
• Professional
profiles groups
are defined in
compliance
with the ESCO
taxonomy
Data Science Professions Family
• Relevance of a
competence to a
DSP profile:
• 5 – high, 1 - low
Mapping DS-BoK GAs to DSP profiles
E - CO2 Classification
• Text Filtering
• Find overlapping terms
• Calculate TF-IDF of terms
• For each category vector calculate cosine similarity
• The output is a CSV with the similarity for each
category
Education offered vs. Market requests
DSDA: Data Science Analytics
DSDK: DS Domain Knowledge (DSDK)
DSEN: Data Science Engineering
DSRM: Scientific/ Research Methods
DSDM: Data Management
DSDA: Data Science Analytics
DSDK: DS Domain Knowledge (DSDK)
DSEN: Data Science Engineering
DSRM: Scientific/ Research Methods
DSDM: Data Management
CV vs. Job offering
• Data Science Model Curriculum includes – Learning Outcomes (LO) definition based on CF-DS
• LOs are defined for CF-DS competence groups and for all
enumerated competences
– LOs mapping to Learning Units (LU) • LUs are based on CCS(2012) and universities best practices
• Data Science university programmes and courses inventory
(interactive) http://edison-project.eu/university-programs-list
– LU/course relevance: Mandatory Tier 1, Tier 2,
Elective, Prerequisite
– Learning methods and learning models (in progress) • Based on Bloom’s Taxonomy, Outcome Based Learning, etc
Data Science Model Curriculum (MC-DS)
Some numbers (2015)
• A portfolio of more than 300 courses
• 200 traineers and experts
• 5 offices and 16 classroom
• 18.000 training person/hours
• New on-line platform
Aosta
Roma
Padova
Milano
Frosinone
Engineering IT & Management school
Accreditation and Certification - RDA BoF
Aim: contribute to the sustainable development of the data
science profession.
Goal: deliver a report that presents a concise but
representative picture of the various accreditation and
certification schemes that exist around the world
Outcome: Need to develop 9 months working group proposal
centered on supporting the members of RDA to develop their
own professional career paths around their own skills, interests
and contexts.
What we can do with you
1. Improve and Validate EDSF
1. Identifying the “soft skills”: how to ask a research/business question?
2. Identifying the Community need: from stewards to scientists, any market, any discipline
3. Validate completeness of BoK, coverage of CF, usability of MC
4. Promote National workshop for bottom-up adoption of EDSF
2. Career Development
1. Specifications for DSP job positions in Data Management and Librarian teams and
Engagement mechanisms Employers/DSP candidates
2. Links and Recommendations for placing students for getting DSP work experience
3. Facilitate cross-institutional agreements on DSP career paths
4. Supporting Training through DataSciencePro.eu
5. Mapping and comparing career paths and Learning opportunities for Personal Competence Portfolio (PCP)
6. Advice Events, Courses and Tools for Community training
7. Develop Virtual Labs, re-usable and promoted further out of your Community
8. Certification: from badges to professions – the How-to of a Community-driven Data
Science Certification (RDA)
• Invitation to contribution and cooperation:
– Forum, EDISON Liaisons Groups, Champions Conference (Spring & Summer
2017)
• EDISON project website http://edison-project.eu/
• EDISON Data Science Framework Release 1 (EDSF)
http://edison-project.eu/edison-data-science-framework-edsf
• Community oriented - Survey Data Science Competences (Available Soon)