Data, Data, Data! 1�
�
�
�
USNA Michelson Lecture 2001
Data, Data, Data!
Challenges and Opportunities of
the coming Data Deluge
David Donoho, Statistics Department, Stanford University
Data, Data, Data! 2�
�
�
�
Thanks to ...
• Office of Naval Research:
– Reza Malek Madani
– Wen Masters
– Carey Schwartz
• Will Glaser, Savage Beast
• Jeffrey Klein, Xamplify
• John Lavery, Army Research Office
• Tracy McSheery, PhaseSpace
• Peter Weinberger, Renaissance Technologies
• Shankar Sastry, UC Berkeley
Data, Data, Data! 3�
�
�
�
Data, Data, Data! 4�
�
�
�
Agenda
I. The Era of Data
II. Its Characteristics
III. Its Critics
IV. Mathematics: The Long View
V. Mathematical Challenge: Navigating High Dimensional Space
Data, Data, Data! 5�
�
�
�
I. The other World Hunger Problem
(a) Sherlock
"Data, Data, Data!I can't make bricks without clay!"-- The Adventure of the Copper Beeches
(b) Quotation
Data, Data, Data! 6�
�
�
�
The coming Data Era
• Since beginning of history: unquenched desire for data
• Coming years, problem will be solved
• Prodigious efforts of best and brightest
• Massive constructions: ambitious and visionary
• Will define an Era
• Will permeate everything
Data, Data, Data! 7�
�
�
�
The Era of Faith
(a) Sistine Chapel (b) Notre Dame
Data, Data, Data! 8�
�
�
�
The coming Data Era
In place of cathedrals: massive investment for
• Ubiquitous sensors
• Pervasive networking
• Gargantuan databases
• Data-rich discourse and decision
Data, Data, Data! 9�
�
�
�
The Data Era Touches Everything...
Touches even the most spiritual places
• Our Fate
• Our Creation
Data, Data, Data! 10�
�
�
�
I.1 Data and Notions of Fate
(a) AB 3700 (b) Celera Genomics
Data, Data, Data! 11�
�
�
�
Decoding the Genome
(a) Nature (b) Science
Data, Data, Data! 12�
�
�
�
Genomic Data Availability
• 109 Base pairs
• 30,000 Genes
• On-line 24*7
• Ubiquitous Software
• Internet Accessibility
Data, Data, Data! 13�
�
�
�
Genomic Data will be touching your fate
• Will you (your favorite phobia here ...)
– Will you die young?
– Will your offspring share your traits?
– Will you react poorly to stress?
• Will we ... conquer disease, defects?
Data, Data, Data! 14�
�
�
�
I.2 Data and Notions of Creation
Data, Data, Data! 15�
�
�
�
Sloan Digital Sky Survey
(a) Sloan Instrument (b) Las Campanas Survey
Data, Data, Data! 16�
�
�
�
Sloan Facts (from www.sdss.org)
• Catalog larger by orders of magnitude tha perviou
• Data available on-line 24*7
• SDSS logo – ”Cosmic Genome Project”
Data, Data, Data! 17�
�
�
�
Data Touch our Idea of Creation
• Sensors seeing back into the earliest moments
• Processing detects the first wrinkles in space/time
Data, Data, Data! 18�
�
�
�
Data Have Revolutionary Impact
• Today: one-time transition in human affairs
• You will be participants and shapers
Comparable Transition: Printing Press
• 50 million books in century 1450-1550
• Mass Communication, Art, Expression
• Exploration, Conquest, Empire
• Religious conflict, Total War
Data, Data, Data! 19�
�
�
�
II. Symbols of the Modern Era?
(a) DataCenter (b) Network
Data, Data, Data! 20�
�
�
�
True characteristics of the Modern Era
Ubiquitous, Pervasive Combination of
• Gathering Data
• Transporting
• Archiving
• Publishing
• Exploiting
Data, Data, Data! 21�
�
�
�
II.1 Gathering Data
Everything Can and Will Become Data
• Smells
• Chemical Vision
• Motions, Gestures
• Preferences
• Essences
• Human Identity
• Behavior
Data, Data, Data! 22�
�
�
�
Electronic Nose (N. Lewis, Caltech)
Data, Data, Data! 23�
�
�
�
Electronic Nose Results
(a) (b)
Data, Data, Data! 24�
�
�
�
Hyperspectral Photography
(a) Eye (b) Spectrum
Data, Data, Data! 25�
�
�
�
Hyperspectral Photography – Derived Image
Figure 11: Derived Image
Data, Data, Data! 26�
�
�
�
Human Motion Tracking – HiBall Tracker
(a) (b)
Data, Data, Data! 27�
�
�
�
SAR/Lidar Sensor Fusion
Data, Data, Data! 28�
�
�
�
New types of data: endless
• Cortical response to stimuli (fMRI)
• Internet Browsing Behavior
• Gait, Posture, Gesture
• NBA: mining videos for player moves
Data, Data, Data! 29�
�
�
�
II.2 Ubiquitous and Pervasive Data Transport
Netcentric Communications
• Internet
• Wireless
• Bluetooth
Data Swamps Everything – Voice, Media etc.
Data, Data, Data! 30�
�
�
�
II.3 Ubiquitous and Pervasive Archiving Data
• Earthquakes
• Galaxies
• Genome
• ...
• Massive
• Comprehensive
• Complete
Data, Data, Data! 31�
�
�
�
Earthquake Catalogs
Data, Data, Data! 32�
�
�
�
II.4 Ubiquitous and Pervasive Publication of Data
(a) Forbidden City (b) Detained EP3 on CNN
Data, Data, Data! 33�
�
�
�
Publication of High-Resolution Data
(a) 1.0 Meter Res (b) 0.5 Meter Res
Data, Data, Data! 34�
�
�
�
Publication of Financial Data
Data, Data, Data! 35�
�
�
�
II.5 Pervasive and Ubiquitous Exploitation
• Scientific
• Military
• Industry
• Investment Community
Data, Data, Data! 36�
�
�
�
Military need for Data
(a) (b)
Data, Data, Data! 37�
�
�
�
Military Contributions to Data Era
• Computers
• Communications
• Networking
• Storage
• Display
Data, Data, Data! 38�
�
�
�
Modernizing War
(a) Publ Docs (b) Army Docs
Data, Data, Data! 39�
�
�
�
Navy Vision
Data, Data, Data! 40�
�
�
�
Unmanned Air Vehicles
Data, Data, Data! 41�
�
�
�
Unmanned Ground Sensors (Concept)
Data, Data, Data! 42�
�
�
�
MicroMotes (Kris Pister, David Culler et al.)
(a) Mote (b) Network
Data, Data, Data! 43�
�
�
�
US Corporations and Data Era
• Size of IT marketplace $750B/year
• Size of IT budgets nearing 50% of total.
Data, Data, Data! 44�
�
�
�
Corporate Organization = Information Organization
• Merchandising
• Airlines
• Communications
• Shipping
• Manufacturing
Data, Data, Data! 45�
�
�
�
WalMart: 7 Terabyte Transactions Database
(a) WalMart (b) Database
Data, Data, Data! 46�
�
�
�
Investment Community and Data Era
Massive Investment in startups for ...
• Network Infrastructure
• Genomic Databases
• Consumer Behavior Tracking
• Mass Personalization
Data, Data, Data! 47�
�
�
�
Example: Savage Beast (Oakland, CA)
(a) Query (b) Approach
Data, Data, Data! 48�
�
�
�
Compiling Musical Genome
Data, Data, Data! 49�
�
�
�
Data, Data, Data! 50�
�
�
�
III Critics of the Data Era, 1
(a) Ralph Nader (b) Pope John
Data, Data, Data! 51�
�
�
�
III Critics of the Data Era, 2
Data, Data, Data! 52�
�
�
�
First Law of Data Smog
Information, once rare and cherished like caviar, is nowplentiful and take for granted like potatoes.
Data, Data, Data! 53�
�
�
�
Typical Quotation from Data Smog
Tens of thousands of words daily pulse through ourbeleaguered brains. Accompanied by a massive amount ofother auditory and visual stimuli. In every moment of theaudio-visual orgy of our highly-informed days, the brainhandles a massive amount of electrical traffic. No wonderwe feel burnt
– Philosopher Philip Novak
Data, Data, Data! 54�
�
�
�
My Reaction
• Beginning of telephone era: people reported shock lasting daysafter a phone call
• Much opposition is simply romanticism
On the Other hand ...
• Serious looming problems in Data era;
• Typical critics ill-equipped to see, understand, or articulatethem.
Data, Data, Data! 55�
�
�
�
IV. The Long View – Mathematics
• Society is making massive bets ...
• Are these wise?
• Will the investment pay off?
• Where do you come in?
Data, Data, Data! 56�
�
�
�
A Common View of Technology
(a) 2001
"Any sufficiently advanced technology is indistinguishable from Magic"
-- Arthur C. Clarke
(b) Quote
Data, Data, Data! 57�
�
�
�
USNA Responsibility
• You are Leaders of Tomorrow
• You must be Tough Customers
• Demand Good Investments
• Make Society’s Investments Pay Off
Mathematics Tells You
• Information technology is not magic
• Extracting Information from data is not a sure thing
• Specific hard work on a case-by-case basis
• You can learn what must be done
Data, Data, Data! 58�
�
�
�
Undergraduate/Graduate Courses Relevant
• Linear Algebra/Vector Analysis
• Statistics/Data Analysis
• Machine Learning/Data Mining
• Information Theory
• Signal Processing
Data, Data, Data! 59�
�
�
�
Mathematical Viewpoint about Database
• Datasets == arrays of numbers
• Array of numbers == points clouds in space
• Exploitation of Data == Recognizing patterns in point clouds
Data, Data, Data! 60�
�
�
�
Statistical Dataset
• N Individuals, p variables
• X(i, j) data 1 ≤ i ≤ N , 1 ≤ j ≤ p
Example Mathematics Student Exams
• N = 25 Students, p = 5 measured variables
• X(i, 1): Diff. Geometry score of student i
• X(i, 2): Complex Analysis score of student i
• X(i, 3): Algebra score of student i
• X(i, 4): Real Analysis score of student i
• X(i, 5): Statistics score of student i
Data, Data, Data! 61�
�
�
�
ID # Diff. Complex Algebra Real Statistics
Geom. Analysis Analysis
1 36 58 43 36 37
2 62 54 50 46 52
3 31 42 41 40 29
4 76 78 69 66 81
5 46 56 52 56 40
6 12 42 38 38 28
7 39 46 51 54 41
. . . . . . . . . . . . . . . . . .
Data, Data, Data! 62�
�
�
�
Dataset as a point cloud in 2-d
0 10 20 30 40 50 60 70 8010
20
30
40
50
60
70
80
90
Differential Geometry Scores
Sta
tistic
s S
core
s
Graduate Student Exam Scores
Data, Data, Data! 63�
�
�
�
Dataset as a point cloud in 3-d
0
20
40
60
80 0
20
40
60
80
10030
35
40
45
50
55
60
65
70
Statistics
Graduate Student Exam Scores
Differential Geometry
Alg
ebra
Data, Data, Data! 64�
�
�
�
Many other Statistical Datasets
• N Individuals, p variables
• X(i, j) data 1 ≤ i ≤ N , 1 ≤ j ≤ p
Examples
• Medical: Individuals = Patients, Variables = Age, BloodPressure, ...
• Financial: Individuals = Stocks, Variables = P/E Ratio,Earnings Growth (APR), ...
• Sports: Individuals = Athletes, Variables = AB, H, W, BA,RBI, HR, ...
Data, Data, Data! 65�
�
�
�
Some Common Patterns in Point Clouds
• Planes
• Clusters
• Outliers
• Filaments
Data Analysis: Finding and Interpreting such Patterns
Data, Data, Data! 66�
�
�
�
Interlude: MacSpin
Data, Data, Data! 67�
�
�
�
Society’s Bets
• Socially useful datasets == point clouds in space
• Exploitation of Data == Recognizing patterns in point clouds
• We will recognize patterns in point clouds which allowexploitation
Data, Data, Data! 68�
�
�
�
V. The True Challenge: High Dimensionality
Modern Data have high volumes
• A Sound: > 3KHz
• An Image: >1M Pixels
Each ’Data Point’ in a database is a highly multidimensional object
• An Utterance: >50K dimensions
• A Face: > 200K dimensions
Data, Data, Data! 69�
�
�
�
Example Modern Databases
Human Identity
• For each person, a 256 ∗ 256 image
• N = 106 individuals (points)
• p = 256 ∗ 256 = 65536 variables (dimensions)
Hyperspectral Image
• For each chemical, a 1024-long spectrum
• N = 5000 compounds (points)
• p = 1024 variables (dimensions)
Data, Data, Data! 70�
�
�
�
Needs with High Dimensional data
• Visualizing high dimensional data
• Navigating high dimensional space
• Discovering high dimensional patterns
Data, Data, Data! 71�
�
�
�
Can We Visualize High Dimensions?
Data, Data, Data! 72�
�
�
�
Data, Data, Data! 73�
�
�
�
Data, Data, Data! 74�
�
�
�Data, Data, Data! 75�
�
�
�
Data, Data, Data! 76�
�
�
�
Data, Data, Data! 77�
�
�
�
Data, Data, Data! 78�
�
�
�Data, Data, Data! 79�
�
�
�
Data, Data, Data! 80�
�
�
�
Data, Data, Data! 81�
�
�
�
Data, Data, Data! 82�
�
�
�
We must Reduce Dimensionality!!
Idea: Project onto Subspace
Data, Data, Data! 83�
�
�
�
Approaches to Reduce Dimensionality
• Ideas from Linear Algebra and Statistics
• Choose new coordinate system (basis)
• Rotate into that basis
• Use only a few of the new coordinates
Data, Data, Data! 84�
�
�
�
Illustration of Rotate/Drop Coordinates
-20
-10
0
10
20
-20
-10
0
10
20-20
-10
0
10
20
Data, Data, Data! 85�
�
�
�
Approaches to Reduce Dimensionality
• V.I Principal Component Analysis
• V.II Multiresolution Analysis
Data, Data, Data! 86�
�
�
�
Dataset as a point cloud in 2-d
0 10 20 30 40 50 60 70 8010
20
30
40
50
60
70
80
90
Differential Geometry Scores
Sta
tistic
s S
core
s
Graduate Student Exam Scores
Data, Data, Data! 87�
�
�
�
Concentration Ellipsoid and Principal Axes
-20 0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
Differential Geometry Scores
Sta
tistic
s S
core
s
Graduate Student Exam Scores
Data, Data, Data! 88�
�
�
�
Principal Components
• Concentration Ellipsoid
• Principal Axes
• Longer Axes – more variation
Sometimes
• Few Axes – most variation
• Summarize many variables by a few.
Data, Data, Data! 89�
�
�
�
PCA Relies on ...
• Linear Algebra (eigenvalues of positive definite matrices)
• Multivariate Statistics (covariance matrices)
• Numerical Computing (eigenvalue solvers)
Used throughout Science and Technology
• Chemometrics
• Hyperspectral Imagery
• Human Identification
Data, Data, Data! 90�
�
�
�
PCA Output on eNose
Data, Data, Data! 91�
�
�
�
PCA Output on Hyperspectral Imagery
Data, Data, Data! 92�
�
�
�
Eigenfaces
(a) UBC Students (b) Principal faces
Data, Data, Data! 93�
�
�
�
Can we discover ...
• ... from data
• ... the true cooordinate system
• Recent approach: ICA
• Uses different statistical principles
Data, Data, Data! 94�
�
�
�
Example: Facial Gestures
Data, Data, Data! 95�
�
�
�
(a) PCA (b) ICA
Sejnowski Lab, Salk Inst.
Data, Data, Data! 96�
�
�
�
V.II Dimensionality Reduction by Multiresolution Anal
• New coordinate system: not pixels but (scale,location)
• Break Data into coarse/medium/fine scales
• Select information by scale/location
• Alter egos: wavelets, multiscale methods
Data, Data, Data! 97�
�
�
�
BigMac
Bandpass BigMac, s=1
Bandpass, s=2
Bandpass, s=3
Partitioned, s=2
Partitioned, s=3 Ridgelet Coeff, s=2
Ridgelet Coeff, s=2
Figure 25: BigMac Image, and stages of Curvelet Analysis.
Data, Data, Data! 98�
�
�
�Figure 26: Fingerprint Movie. 64W + 0/256/768/3072 Curvelets
Data, Data, Data! 99�
�
�
�
Figure 27: Lichtenstein, ‘In The Car’; 64 Wavelets+ 256 Curvelets
Data, Data, Data! 100�
�
�
�
V.III Nonlinear Structures in High-D Space
(a) Scatter (b) Manifold
Data, Data, Data! 101�
�
�
�
Example of Nonlinear Structure: Images
Data, Data, Data! 102�
�
�
�
Can We Discover ...
• ... from data
• ... an underlying nonlinear coordinate system
Relevant Courses
• Linear Algebra
• Multivariate Statistics
• Differential Geometry of Curves and Surfaces
Data, Data, Data! 103�
�
�
�
ISOMAP (Josh Tenenbaum, Stanford)
Data, Data, Data! 104�
�
�
�
Many Reasons to Navigate High Dimensions
• Locate Best Matches in Multimedia Databases
• Steganography
• Understand tactical tendencies of opponent
• Maneuver planning
Many Methods to Navigate High Dimensions
• Sonification
• Diverse Mathematical Ideas
Data, Data, Data! 105�
�
�
�
Summary
• Entering Era of Data
• Serious Challenges: Data Smog, Data Glut
• Mathematical Challenges:navigate, explore high dimensional space
• Only scratched surface in this lecture
• Progress important for Navy Goals and Society in General
• You can contribute
Top Related