Big Data Challenges: Data Management, Analytics & Security - BigDataChallenges_AAAS_2014.pdfBig Data...
Transcript of Big Data Challenges: Data Management, Analytics & Security - BigDataChallenges_AAAS_2014.pdfBig Data...
Big Data Challenges: Data Management, Analytics & Security
Ivo D. Dinov
Statistics Online Computational Resource University of Michigan
www.SOCR.umich.edu
Big Data Challenges • Availability, Sharing, Aggregation and Services • Classical Data Science vs. Innovative Big Data Science
– Amateur Scientists vs. “Experts” – Data Scientists vs. Practitioners – Domain-specific vs. Trans-disciplinary knowledge
• Commercial vs. Open-source Resourceome • Rapid Big Data Evolution • Big Data IT proliferation • Big Data Security risks • Centralization won’t work in Big Data Space • Big Data is incredibly time, space, protocol, context dependent!
Big Data Characteristics
* Mixture of quantitative & qualitative estimates Dinov et al. (2013)
Availability, Sharing, Aggregation & Services • Cisco: "By the end of 2012, the number of mobile-connected
devices [exceeded] the number of people on Earth” • There will be over 10 billion mobile-connected devices in 2016;
i.e., there will be 1.3 mobile devices per capita
U.S. Bureau of Labor Statistics M
cKinsey Global Institute
Perc
ent G
row
th
Big Data Value Potential Index
Bubble Size ~ Relative size of GDP Industry Sector Computer & Electronic Products Information Services Manufacturing Admin, support & waste management Transportation & Warehousing Wholesale Trade Professional Services Healthcare Providers Real Estate and Rental Finance and Insurance Utilities Retail Trade Government Accomodation & Food Arts & Enterntainment Corporate Management Other Services Construction Education Services Natural Resources
Amateur Scientists vs. “Experts”
• Democratization of Big Data Science • Doctorate studies/certification is not mandatory nor does it
guarantee appropriate Big Data expertise • Lower barriers of entry • Demand for constant “Continuing Education” and self-training • Dichotomy between theoretical and empirical sciences • Differences between fundamental knowledge and
experimental skills (big data properties closely approximate core scientific principles)
Big Data Science
Medical Sciences Social Sciences Environmental Sciences ....
Math/Stats Physics Biology Chemistry ....
Engineering Computer Science Bioinformatics Biomath/Biostats ....
Domain-specific vs. Trans-disciplinary knowledge
Commercial vs. Open-source Resourceome
• There is an explosion of open-data-science resources – www.data.gov – www.ncbi.nlm.nih.gov/gap
• Spawning of a number of industries and enterprises blending proprietary and open-source data, code, documentation, expert-support, infrastructure and services
• Big Data to Knowledge: www.BD2K.org • Google Cloud Platform (GCP) • Amazon Web Services (AWS)
Commercial vs. Open-source Resourceome
Rapid Big Data Evolution
• Millions of Grass-Roots initiatives addressing Big Data Challenges
• Big Data complexities require truly innovative, collaborative, trans-disciplinary solutions
• Increase of Data complexity – Sources – Heterogeneity – Datum-elements – Incongruent sampling
Data Scientists vs. Practitioners
• Modelers, Engineers, (Applied) Users • No one user completely understands the entire pipeline of data
provenance, processing protocols, analytic strategies, or results interpretation
• Black-boxes …. – Accuracy – Privacy concerns – Consistency – Infrastructure
Big Data Security Risks
• Big Data Fusion provides enormous opportunities … and presents significant challenges • Privacy, security and legal concerns, authenticity, accuracy,
consistency, reliability, availability • Healthcare
– The cloud services enable sharing big data – Significant security and privacy concerns exist, – Health Insurance Portability and Accountability Act (HIPAA) – EMR/EHR Federal, state and local regulations/policies (IRBMED)
• Genetics • Viral - Dual-use research of concern (DURC), 10.1126/science.1223995
– de novo synthesis of polio virus, the Australian mousepox experiment, the Penn State aerosolization study
Kryder’s law: Exponential Growth of Data
Dinov, et al., 2013
Gryo_Byte
Cryo_Short
Cryo_Color0
2E+15
4E+15
6E+15
1 µm10 µm
100 µm1mm
1cm
Gryo_ByteCryo_ShortCryo_Color
Neuroimaging(GB)
Genomics_BP(GB)Moore’s Law (1000'sTrans/CPU) 0
5000000
10000000
15000000
1985-19891990-1994 1995-1999 2000-20042005-2009
2010-20142015-2019
(estimated)
Neuroimaging(GB)Genomics_BP(GB)Moore’s Law (1000'sTrans/CPU)
Increase of Imaging Resolution
Data volume Increases faster than computational power
Alzheimer’s Case Study: Stable-MCI vs. MCI-Converters
Goals predictive-power of combinations of biomarkers and imaging derivative measures to provide reliable predictors of conversion from MCI to Alzheimer’s disease
Data MCI converters to AD (24-month period) and stable non-converters; matched for age, gender, handedness, education level Imaging (sMRI), Behavioral, Clinical, Neuropsychiatric, Biological data
Approach Qualitative Exploratory Data Analysis and Quantitative Statistical Analysis (morphometric imaging correlates with clinical and genetics markers)
MCI = Mild Cognitive Impairment (prelude to dementia of Alzheimer’s type)
Subject
Demo-graphics
Gene-tics Clinical Neuroimaging …
Index
Age
Kg
Sex
APO
E A
1
APO
E A
2
NPI
SCO
RE
MM
SE
GD
TO
TAL
CDR
FAQ
TO
TAL
L Gy
rus R
ectu
s BL
L
Supe
rior
Occ
ipita
l Gyr
us B
L R
Fusif
orm
Gy
rus B
L
L Ca
udat
e BL
R Ca
udat
e BL
L Pu
tam
en B
L
R Pu
tam
en B
L
…
1 65 59 F 3 4 0 23 1 0.5 7 1695 3976 8363 1296 1992 1749 2776 …
2 73 93 M 3 3 7 19 1 1 8 1333 6016 13290 835 2137 2290 4327 …
... ... ... ... ... ... ... ... ... ... ... … … … … … … … …
N 64 63 F 3 3 3 29 6 0.5 2 2237 6887 16109 1223 2222 2525 4110 …
Alzheimer’s Case Study: Stable-MCI vs. MCI-Converters
Classification Results Using Baseline Data
True State (Dx at 24 month follow up)
Converter Stable Total Hierarchical Clustering
Prediction Ana (7 Regions)
Converter TP FP TP+FP Stable FN TN FN+TN Total TP+FN FP+TN N
Metric Value
Top 7 Regions Top 20 Regions Sensitivity 0.81 1.0 Specificity 0.61 0.87 Power to detect Converters 0.91 1.0
Accuracy 0.70 0.93
Alzheimer’s Case Study: Stable-MCI vs. MCI-Converters