The Anatomy and Physiology of Data Science Peter Fox 1 ([email protected]) cs.rpi.edu ( 1.

1
The Anatomy and Physiology of Data Science Peter Fox 1 ([email protected] ) http://tw.rpi.edu/web/Courses ( 1 Rensselaer Polytechnic Institute 110 8 th St., Troy, NY, 12180 United States – see Acknowledgements) Glossary: RPI – Rensselaer Polytechnic Institute TWC – Tetherless World Constellation at Rensselaer Polytechnic Institute Acknowledgments: TWC eScience Group W3C Provenance Working Group Sponsors: Rensselaer Polytechnic Institute Tetherless World Constellation MOTIVATION Whether the science (especially geosciences) community at-large likes it or not, the co-opting of the term Data Science by the private sector has led to increased hype over data science as a career and as a means to solve challenging data problems, and lack of educational innovation in curricula for data science. If the full benefits of a new generation of statistical and analytical software tools that operate on high-performance computational infrastructure are to be attained, adequate attention to the 'science of data science' is needed. In this contribution, we present a science view of data science both from an education and research perspective. We introduce a research agenda that explores the key challenges that must be met to meet the needs of research driven by large-scale data analytics. We focus on three, as-yet untapped, data science topics: understanding scale in systems, sparse systems, and abductive reasoning. We conclude with a specific call to action to make progress on the aforementioned topics. The Landscape – Data Ecosystem and What Makes Up a Data Scientist? Learning Outcomes Physiology (in a group) Definition of Science Hypotheses, Guiding Questions Finding and Integrating Datasets Presenting Analyses and Viz. Presenting Conclusions Institutions to provide reliable, high-functionality data infrastructures that facilitate analytics Provision of intermediate to advanced Statistics to undergraduates and early graduate students Well-curted datasets are made widely available along with developed models and validation statistics All results are under continuous scrutiny, are traceable and verifiable AGUFM14 – ED31E-3455 (MS Hall A-C) To demonstrate knowledge of relevant analytic methods, and to recognize and apply quantitative algorithms, techniques and interpret results To demonstrate strategic thinking skills, combined with a solid technical foundation in data and model-driven decision-making. To develop ability to apply critical and analytical methods to formulate and solve science, engineering, medical, and business problems Examine real-world examples to place data-mining techniques in context, develop data-analytic thinking, to illustrate that their application is art and science. Must effectively communicate analytic findings to non-specialists. Must develop and demonstrate a working knowledge of decision making under uncertainty, be able to build optimization models that incorporate random parameters: static stochastic optimization, two-stage optimization Anatomy (as an individual) Data Life Cycle – Acquisition, Curation and Preservation Data Management and Products Forms of Analysis, Errors and Uncertainty Technical tools and standards Anatomy study of the structure and relationship between body parts Physiology is the study of the function of body parts and the body as a whole. 1 Data Information Knowledge Producers Consumers Context Presentation Organization Integration Conversation Creation Gathering Experience BigData Science (Data Analytics) Anatomy & Physiology Call To Action Learning Outcomes “Data” Science Anatomy & Physiology Call To Action Anatomy (individual) Intermediate Skill in parametric and non- parametric statistics Application of a broad spectrum of Data Mining and Machine Learning Algorithms Ability to cross-validate and optimize models Application to specific datasets Through class lectures, practical sessions, written and oral presentation assignments and projects, students should: Develop and demonstrate skill in Data Collection and Data Management Demonstrate proficiency in Data/ Information Product Generation Demonstrate science-driven Analysis and Presentation of Integrated Datasets from the Web Demonstrate the development and application of Data Models Convey knowledge of and apply Data and Metadata Standards and explaining Provenance Apply Data Life-Cycle principles, construct Data Workflows Develop and demonstrate skill in Data Tool Use and Evaluation Data Science across the curriculum Same as “Calculus” And … Intro to Statistics Data Management is Second Nature Like operating an instrument Openness/ sharing is the natural state As-a-whole, the Data Scientist works collaboratively and is recognized and rewarded by peers and organizations Data Science primarily advances the inductive conduct of science but to understand scale in systems, accommodate sparse systems, and provide for abductive reasoning, data scientists must progress to data analyticists. Data science is advancing the inductive conduct of science and is driven by the greater volumes, complexity and heterogeneity of data being made available over the Internet. Data science combines aspects of data management, library science, computer science, and physical science using supporting cyberinfrastructure and information technology. It is changing the way all of these disciplines do both their individual and collaborative work. Key methodologies in application areas based on real research experience are taught to build a skill-set. Data and Information analytics extends analysis (descriptive and predictive models to obtain knowledge from data) by using insight from analyses to recommend action or to guide and communicate decision-making. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with an entire methodology. The world at- large is confronted with increasingly larger and complex sets of structured/unstructured information; from sensors, instruments, and generated by computer simulations; data is "hidden" in websites, application servers, social networks and on mobile devices. In commerce and industry, analytics-driven enterprises are becoming mainstream. Yet, there is a shortfall in the key education skills needed to meet the growing needs. Traditional enterprises are moving toward analytics-driven approaches for core business functions. In the government and corporations, cybersecurity problems are prevalent. Key topics include: advanced statistical computing theory, multivariate analysis, and application of computer science courses such as data mining and machine learning and change detection by uncovering unexpected patterns in Lt. Cmdr Data, Star Trek TNG Lt. Cmdr Data and Friends Overused Venn diagram of the intersection of skills needed for Data Science (Drew Conway) The Data- Informati on- Knowledge Ecosystem (Fox; derived) ? Physiology (term project) Definition of Science Hypotheses, with Prediction/ Prescription Goal Cleaning and Preparing Datasets Validating and Verifying Models Presenting Ideas and Results

Transcript of The Anatomy and Physiology of Data Science Peter Fox 1 ([email protected]) cs.rpi.edu ( 1.

Page 1: The Anatomy and Physiology of Data Science Peter Fox 1 (pfox@cs.rpi.edu) cs.rpi.edu  ( 1.

The Anatomy and Physiology of Data Science

Peter Fox1 ([email protected]) http://tw.rpi.edu/web/Courses

(1Rensselaer Polytechnic Institute 110 8th St., Troy, NY, 12180 United States – see Acknowledgements)

Glossary:RPI – Rensselaer Polytechnic InstituteTWC – Tetherless World Constellation at Rensselaer Polytechnic Institute

Acknowledgments:TWC eScience GroupW3C Provenance Working Group

Sponsors:Rensselaer Polytechnic InstituteTetherless World Constellation

MOTIVATION

Whether the science (especially geosciences) community at-large likes it or not, the co-opting of the term Data Science by the private sector has led to increased hype over data science as a career and as a means to solve challenging data problems, and lack of educational innovation in curricula for data science.

If the full benefits of a new generation of statistical and analytical software tools that operate on high-performance computational infrastructure are to be attained, adequate attention to the 'science of data science' is needed. In this contribution, we present a science view of data science both from an education and research perspective.

We introduce a research agenda that explores the key challenges that must be met to meet the needs of research driven by large-scale data analytics.

We focus on three, as-yet untapped, data science topics: understanding scale in systems, sparse systems, and abductive reasoning.

We conclude with a specific call to action to make progress on the aforementioned topics.

The Landscape – Data Ecosystem and What Makes Up a Data Scientist?

Learning Outcomes

Physiology (in a group) Definition of Science Hypotheses,

Guiding Questions Finding and Integrating Datasets Presenting Analyses and Viz. Presenting Conclusions

Institutions to provide reliable,

high-functionality data

infrastructures that facilitate

analytics Provision of intermediate to

advanced Statistics to

undergraduates and early graduate

students Well-curted datasets are made

widely available along with

developed models and validation

statistics All results are under continuous

scrutiny, are traceable and

verifiable

AGUFM14 – ED31E-3455 (MS Hall A-C)

To demonstrate knowledge of relevant analytic methods, and to recognize and apply quantitative algorithms, techniques and interpret results

To demonstrate strategic thinking skills, combined with a solid technical foundation in data and model-driven decision-making.

To develop ability to apply critical and analytical methods to formulate and solve science, engineering, medical, and business problems

Examine real-world examples to place data-mining techniques in context, develop data-analytic thinking, to illustrate that their application is art and science.

Must effectively communicate analytic findings to non-specialists.

Must develop and demonstrate a working knowledge of decision making under uncertainty, be able to build optimization models that incorporate random parameters: static stochastic optimization, two-stage optimization with recourse, chance-constrained optimization, and sequential decision making.

Anatomy (as an individual) Data Life Cycle – Acquisition,

Curation and Preservation Data Management and Products Forms of Analysis, Errors and

Uncertainty Technical tools and standards

Anatomy study of the structure and relationship between body parts

Physiology is the study of the function of body parts and the body as a whole.

1

Data Information Knowledge

Producers Consumers

Context

PresentationOrganization

IntegrationConversation

CreationGathering

Experience

BigData Science (Data Analytics) Anatomy & Physiology Call To Action

Learning Outcomes“Data” Science Anatomy & Physiology Call To Action

Anatomy (individual) Intermediate Skill in parametric

and non-parametric statistics Application of a broad spectrum

of Data Mining and Machine

Learning Algorithms Ability to cross-validate and

optimize models Application to specific datasets

Through class lectures, practical sessions, written and oral

presentation assignments and projects, students should:

Develop and demonstrate skill in Data Collection and Data

Management

Demonstrate proficiency in Data/ Information Product

Generation

Demonstrate science-driven Analysis and Presentation of

Integrated Datasets from the Web

Demonstrate the development and application of Data Models

Convey knowledge of and apply Data and Metadata Standards

and explaining Provenance

Apply Data Life-Cycle principles, construct Data Workflows

Develop and demonstrate skill in Data Tool Use and

Evaluation

Data Science across the curriculum Same as “Calculus” And … Intro to Statistics

Data Management is Second

Nature Like operating an instrument Openness/ sharing is the natural

state As-a-whole, the Data Scientist

works collaboratively and is

recognized and rewarded by peers

and organizations

Data Science primarily advances the inductive conduct of science but to understand scale in systems, accommodate sparse systems, and provide for abductive reasoning, data scientists must progress to data analyticists.

Data science is advancing the inductive conduct of science and is driven by the greater volumes, complexity and heterogeneity of data being made available over the Internet. Data science combines aspects of data management, library science, computer science, and physical science using supporting cyberinfrastructure and information technology. It is changing the way all of these disciplines do both their individual and collaborative work. Key methodologies in application areas based on real research experience are taught to build a skill-set.

Data and Information analytics extends analysis (descriptive and predictive models to obtain knowledge from data) by using insight from analyses to recommend action or to guide and communicate decision-making. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with an entire methodology. The world at-large is confronted with increasingly larger and complex sets of structured/unstructured information; from sensors, instruments, and generated by computer simulations; data is "hidden" in websites, application servers, social networks and on mobile devices. In commerce and industry, analytics-driven enterprises are becoming mainstream. Yet, there is a shortfall in the key education skills needed to meet the growing needs. Traditional enterprises are moving toward analytics-driven approaches for core business functions. In the government and corporations, cybersecurity problems are prevalent.

Key topics include: advanced statistical computing theory, multivariate analysis, and application of computer science courses such as data mining and machine learning and change detection by uncovering unexpected patterns in data.

Lt. Cmdr Data, Star Trek TNG

Lt. Cmdr Data and Friends

Overused Venn diagram of the intersection of skills needed for Data Science (Drew Conway)

The Data-Information-Knowledge Ecosystem (Fox; derived)

?

Physiology (term project) Definition of Science Hypotheses,

with Prediction/ Prescription Goal Cleaning and Preparing Datasets Validating and Verifying Models Presenting Ideas and Results