0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition,...

22
!"#$%&'() + ,-%')'). /-(.-%0 /-121)&1-2 Ivo D. Dinov Assoc. Director MIDAS Ed & Training Nandit Soparkar CEO, Ubiquiti Patrick Harrington Co-Founder Chief Data Scientist CompGenome.com Erin Shellman Research Scientist AWS

Transcript of 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition,...

Page 1: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

!"#$%&'()*+*,-%')').*/-(.-%0*/-121)&1-23

Ivo D. Dinov Assoc. Director MIDAS Ed & Training

Nandit Soparkar CEO, Ubiquiti

Patrick Harrington Co-Founder Chief Data Scientist CompGenome.com

Erin Shellman Research Scientist AWS

Page 2: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

MIDAS Data Science Education & Training:

Challenges & Opportunities

Ivo D. Dinov

Associate Professor, School of Nursing

Associate Education Director, Michigan Institute for Data Science (MIDAS)

University of Michigan http://MIDAS.umich.edu/education

Page 3: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

4%&'()%5*6'.*7%&%*8$'1)$1**9#--'$#5%*9()2&155%&'()3

Recently established Data Science institutes and curricular programs!

MIDAS

Page 4: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

:;7<8*6'.*7%&%*!$(2=2&103

http://socr.umich.edu/docs/BD2K/BigDataResourceome.html!

:;7<8*6'.*7%&%*!$(2=2&103

http://socr.umich.edu/docs/BD2K/BigDataResourceome.htmlhttp://socr.umich.edu/docs/BD2K/BigDataResourceome.htmlhttp://socr.umich.edu/docs/BD2K/BigDataResourceome.html

MIDAS

Page 5: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

:;7<8*>($#2*()*71?15(@').*6'.*7%&%*8A'5523

o! Listening: streams, analyzing sentiment, intent and trends;

o! Looking: searching, indexing and memory management of heterogeneous datasets; Raw, derived or indexed data as well as meta-data;

o! Programming: Handling Map-Reduce/HDFS, No-SQL DB, protocol provenance, algorithm development/optimization, pipeline workflows;

o! Inferring: Principles of data analyses, Bayesian modeling, inference, uncertainty and quantification of likelihoods; Reasoning & logic; Analytics: Regression, feature selection, dimensionality reduction, temporal patterns, validation; Actionable Knowledge

o! Machine Learning: Classification, clustering, mining, information extraction, knowledge retrieval, decision making;

o! Predicting: Forecasting, neural models, deep learning, and research topics;

o! Summarizing: Presentation of data, processing protocol, analytics provenance, visualization, synthesis

Page 6: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

8@1$'B$*7%&%*8$'1)$1*,-%')').*9C%551).123

o! Technological advances and students’ IT skills far outpace educators’ technological expertise and the current rate of IT adoption in college curricula

o! Lack of open learning resources and interactive interoperable platforms

o! Difficulty sharing data, tools, materials & activities across different disciplines

o! Limited technology-enhanced continuing instructor education opportunities

o! Discipline-specific knowledge boundaries o! Skills for communication and teamwork

involving cooperation in dynamic groups of dispersed researchers

o! Ability to aggregate & harmonize data (e.g., fusion of qualitative and quantitative elements).

knowledge boundaries

(e.g., fusion of qualitative and quantitative

and the current

data, tools, materials

continuing

knowledge boundaries knowledge boundaries

continuing

Page 7: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

9(-1*:;7<8*!"#$%&'()*9(0@()1)&23

o! Drivers: Motivations, datasets (6D of Big Data), challenges, applications!

o! Methods: foundational and trans-disciplinary scientific techniques!

o! Tools: web-services, software, code, platforms!

o! Analytics: practice of data interrogation!

Page 8: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

!D'2&1)&*7%&%*8$'1)$1*!"#$%&'()*+*,-%')').3

o! Undergraduate DS Degree Program!o! Stats + Engineering!o! Started Fall 2015!o! www.eecs.umich.edu/eecs/undergraduate/data-

science!

Undergraduate DS Degree Program

o! DS Summer Institute (Public Health)!o! 20+ faculty, 40+ trainees!o! Full support/in residence (1 month)!o! BigDataSummerInst.sph.umich.edu!

o! Big Data Summer Bootcamp!o! 5 years running, Business-

Engineering!o! Practice of Econ-Bio-Social

Analytics!o! ibug-um15.github.io/2015-summer-

camp!o! Graduate DS Certificate Program!o! Transdisciplinary (SM,SPH,LS&A,SN,CoE,SI)!o! 12 cr, Modeling+Technology

+Practice!o! http://MIDAS.umich.edu/certificate!

Full support/in residence (1 month)BigDataSummerInst.sph.umich.edu

SM,SPH,LS&A,SN,CoE,SI)

Page 9: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

:;7<8*!"#$%&'()*+*,-%')').*EF(').*>(-G%-"H3

o! Graduate Data Science Certificate Program!o! Enrollment of 100 (UMich) students!

o! Develop the Online Graduate Data Science Certificate Curriculum!

o! Develop a 5-yr dual undergrad-grad MS Program in Data Science!o! DS Summer Institute (Public Health)!o! Big Data Summer Bootcamp!

o! MIDAS Seminar Series (academia, industry, government, partners)!o! Web-streamed and archived for broader impact!

o! National Events!o! BD2K, Midwest Big-Data-Hub, Math, Stats, Computer Vision,

Machine Learning, Informatics, …!

o! Funding: NIH, NSF, Foundation: Research & Traineeship Applications!

Page 10: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

:'$C'.%)*7'I1-1)$1*')*37%&%*8$'1)$1*!"#$%&'()*+*,-%')').3

o! Public-Private-Partnership (PPP) Model!

o! Integration of Research + Practice + Education!

o! Problems + Modeling + Technology + Applications!

UMich

Industry Fed Orgs

Community

Problems + Modeling + Technology + ApplicationsProblems + Modeling + Technology + Applications AtheyLab!

Page 11: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

8&%&'2&'$2*J)5')1*9(0@#&%&'()%5*K12(#-$1*E8J9KH*;)&1.-%&'()*(L*!"#$%&'()M*K121%-$C*+*/-%$&'$13

o! Probability and Statistics EBook!o! Motivation, Methods, Techniques,

Practice!o! wiki.socr.umich.edu/index.php/EBook!

o! Data sets!o! 100’s of Research, Observed & Simulated!o! wiki.socr.umich.edu/index.php/

SOCR_Data!o! Hands-on Activities (Team-focused)!

o! Modeling, Inference, Concept Demos!o! wiki.socr.umich.edu/index.php/SOCR_EduMaterials!

o! Applets and Webapps!o! Over 500 Web tools!o! Distributed services based on Java, HTML5/

JavaScript!

SOCR has over 9.6M users worldwide since 2002!www.SOCR.umich.edu!

Features: Free, Cloud-service, no-barrier access, LGPL/CC-BY!Ref: Dinov et al., TS (2013), DOI: 10.1111/test.12012!

100’s of Research, Observed & Simulated

(Team-focused)Modeling, Inference, Concept Demos

SOCR_EduMaterials

Page 12: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

:;7<8N8J9K*7%2CO(%-"P*6'.*7%&%*>#2'()*!D%0@513

http://socr.umich.edu/HTML5/Dashboard!

Page 13: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

6'.*7%&%*<)%5=&'$23http://socr.umich.edu/HTML5/Dashboard!

o! Web-service combining and integrating multi-source socioeconomic and medical datasets !

o! Big data analytic processing!

o! Interface for exploratory navigation, manipulation and visualization!

o! Adding/removing of visual queries and interactive exploration of multivariate associations!

o! Powerful HTML5 technology enabling mobile on-demand computing!

Husain, et al., 2015, J Big Data!

Page 14: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

:;7<8*J)5')1*7%&%*8$'1)$1*!"#$%&'()*/-(.-%0*E78!/H3

MIDAS plans to:

o! Develop new hands-on MOOC DS courses (data, learning modules, services, code snippets, applications/partners) …

o! Deploy MIDAS Training as a Service (TaaS) MOOC platform

o! Compile Datasets (research-derived, observational, simulated) – identify, aggregate, manage, navigate, and service exemplary Big Data sets

o! Faculty/Mentors–instructors, collaborators, partners, students and staff to develop a core DS course-series (4 courses)

o! Instructional resources and complete end-to-end learning modules

o! Track and Validate the DSEP Program – annual reports (MIDAS ETC), review of evaluations, sustainability (financial, resources, space), impact

Page 15: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

How to Engage & Partner in MIDAS Education?

Partners!

o! Open-Source Community!

o! Academia!o! Trainees!o! Industry!o! Non-profit!o! Government!o! Philanthropy!o! Community

Orgs!

Activities!!

Drivers!o! Challenges!o! Datasets!o! Applications!

Methods!!Technologies!!Tools/Services!!Training Opps!o! Mentorship!o! DS Projects!o! Internships!

Page 16: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

:;7<8*!"#$%&'()*+*,-%')').*,1%03

MIDAS Education & Training Committee Ivo Dinov, Margaret Hedstrom, Honglak Lee, Sebastian Zöllner, Richard Gonzalez, Kerby Shedden

Other Contributing MIDAS Faculty Engineering

Alfred Hero: Electrical Engineering and Computer Science; Biomedical Engineering, Stats H. V. Jagadish: Electrical Engineering and Computer Science

Mike Cafarella: Computer Science and Engineering

Karthik Duraisamy: Atmospheric, Oceanic, and Space Sciences

Judy Jin: Industrial & Operations Engineering

Dragomir Radev: School of Information; Computer Science and Engineering; Linguistics

Information & Health Sciences Brian Athey: Computational Medicine and Bioinformatics, Medicine

Carl Lagoze: School of Information

Qiaozhu Mei: School of Information

Jeremy Taylor, Biostatistics, Public Health

Basic Sciences

Vijay Nair: Statistics & Engineering

George Alter: Institute for Social Research, History Christopher Miller: Physics & Astronomy

August Evrard: Physics & Astronomy

Anna Gilbert: Mathematics, Engineering

Stephen Smith: Ecology and Evolutionary Biology

Ambuj Tewari: Statistics; Computer Science and Engineering

Contact

http://MIDAS.umich.edu !

[email protected]!

[email protected]

Page 17: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

!"#$%&'()*+*,-%')').*/-(.-%0*/-121)&1-23

Ivo D. Dinov Assoc. Director MIDAS Ed & Training

Nandit Soparkar CEO, Ubiquiti

Patrick Harrington Co-Founder/Chief Data Scientist CompGenome.com

Erin Shellman Research Scientist, AWS

Page 18: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

!"#$%&'()*+*,-%')').*/-(.-%0*/-121)&1-23

Ivo D. Dinov Assoc. Director MIDAS Ed & Training

Nandit Soparkar CEO, Ubiquiti

Patrick Harrington Co-Founder Chief Data Scientist CompGenome.com

Erin Shellman Research Scientist AWS

Page 19: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

6'.*7%&%*8$'1)$13 From a descriptive to a constructive definition of Big Data

–! IBM’s 4V’s – volume, variety, velocity (speed) and veracity (reliability)

–! MIDAS Big Data Characterization – identifies gaps, challenges & needs

Example: analyzing observational data of 1,000’s Parkinson’s disease patients based on 10,000’s signature biomarkers derived from multi-source imaging, genetics, clinical, physiologic, phenomics and demographic!data elements. !!Software developments, student training, service platforms and methodological advances associated with the Big Data Discovery Science all present existing opportunities for learners, educators, researchers, practitioners and policy makers!

Mixture of quantitative!& qualitative estimates!Dinov, et al. (2014)!

Incompleteness

Size Expected 2016 Increase 2018 Expected 2016 Increase 2018Expected 2016 Increase 2018

Page 20: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

Q-="1-R2*5%GP*!D@()1)&'%5*F-(G&C*(L*7%&%*3

Dinov, et al., 2014!

Gryo_Byte

Cryo_Short

Cryo_Color 0

2E+15

4E+15

6E+15

1 µm 10 µm

100 µm 1mm

1cm

Gryo_Byte

Cryo_Short

Cryo_Color

1mm

Neuroimaging(GB) Genomics_BP(GB) Moore’s Law (1000'sTrans/CPU) 0

5000000

10000000

15000000

Neuroimaging(GB)

Genomics_BP(GB)

Moore’s Law (1000'sTrans/CPU)

Increase of Image Resolution!

Data volume!Increases faster !than computational power!

" Time "!

Wal

ter,

Sci

Am

, 200

5!st

orag

e do

uble

s ev

ery

1.2

yrs!

Com

puta

tiona

l pow

er !

doub

le e

ach

1.6

year

s!

Page 21: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

7%&%*8$'1)$1*,-%')').*/-(.-%0P*9(-1*9#--'$#5#03

Themes! Training! Examples!

Dat

a

Scru

bbin

g! Big Data Management!

Big Data technology, Searching, indexing, memory management, Information extraction, feature selection, Supervised and unsupervised-learning, stream mining

Big Data Representation!

Matrix Representation of Sets, Minhashing, Jaccard Similarity, Distance Measures, Euclidean Distances, Jaccard Distance, Cosine Distance, Edit Distance, Hamming Distance, Networks, Graph similarity, Sets as Strings, Prefix Indexing, Data Streams and Processing, Representative Samples, Bloom Filter, Flajolet-Martin Algorithm, TF/IDF, MoM, MLE, Alon-Matias-Szegedy Algorithm for Second Moments, Datar-Gionis-Indyk-Motwani Algorithm (DGIM)!

Dat

a

Min

ing! Mining Social-

Network Graphs!

Bio-Social Network Graphs, Root-Mean-Square Error, UV-Decomposition, The NetFlix Challenge, Tweeter Challenge, Distances and Clustering of Social-Network Graphs, Network Cluster Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/vectors of the Laplacian Matrix, Graph Neighborhood Properties, Diameter of a Graph, Transitive Closure and Reachability, Crowdsourcing Algorithms

Modeling! Statistical Modeling, Machine Learning, Computational Modeling, Feature Extraction, Power Laws, Map-Reduce/Hadoop!

Link Analysis! PageRank, Spider Traps, Taxation and loops in Big Data traversal, Random Walks!

Dat

a

Infe

renc

e! Clustering!Curse of Dimensionality, Classification and Regression Trees/Random Forests, Hierarchical Clustering, K-means Algorithms, Bradley, Fayyad, and Reina (BFR) Algorithm, CURE Algorithm, Clusters in the GRGPF Algorithm!

Classification! Random Forest, SVM, Neural Networks, Latent Class Models, Finite Mixture Models!

Dimensionality Reduction!

Eigenvalues and Eigenvectors, Principal-Component Analysis, Singular-Value Decomposition, Wrapper-Based vs Filter-Based Feature Selection, Independent Components Analysis, Multidimensional Scaling, CUR Decomposition!

Page 22: 0*/-121)&1-23 - MIDAS · 2016-04-27 · Betweenness, Complete Bipartite Subgraphs, Graph Partition, Matrices and Graphs, Eigen-values/ vectors of the Laplacian Matrix, Graph Neighborhood

BD

Big Data Information Knowledge Action Raw Observations Processed Data Maps, Models Actionable Decisions

Data Aggregation Data Fusion Causal Inference Treatment Regimens

Data Scrubbing Summary Stats Networks, Analytics Forecasts, Predictions

Semantic-Mapping Derived Biomarkers Linkages, Associations Healthcare Outcomes

BDBDBDBDBDBDBD

Information Knowledge Processed Data Maps, Models

I K

Knowledge Action Maps, Models Actionable Decisions

Knowledge Maps, Models

A