the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6...

27
Data, Data, Data! 1 USNA Michelson Lecture 2001 Data, Data, Data! Challenges and Opportunities of the coming Data Deluge David Donoho, Statistics Department, Stanford University Data, Data, Data! 2 Thanks to ... Office of Naval Research: Reza Malek Madani Wen Masters Carey Schwartz Will Glaser, Savage Beast Jeffrey Klein, Xamplify John Lavery, Army Research Office Tracy McSheery, PhaseSpace Peter Weinberger, Renaissance Technologies Shankar Sastry, UC Berkeley Data, Data, Data! 3 Data, Data, Data! 4 Agenda I. The Era of Data II. Its Characteristics III. Its Critics IV. Mathematics: The Long View V. Mathematical Challenge: Navigating High Dimensional Space

Transcript of the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6...

Page 1: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 1�

USNA Michelson Lecture 2001

Data, Data, Data!

Challenges and Opportunities of

the coming Data Deluge

David Donoho, Statistics Department, Stanford University

Data, Data, Data! 2�

Thanks to ...

• Office of Naval Research:

– Reza Malek Madani

– Wen Masters

– Carey Schwartz

• Will Glaser, Savage Beast

• Jeffrey Klein, Xamplify

• John Lavery, Army Research Office

• Tracy McSheery, PhaseSpace

• Peter Weinberger, Renaissance Technologies

• Shankar Sastry, UC Berkeley

Data, Data, Data! 3�

Data, Data, Data! 4�

Agenda

I. The Era of Data

II. Its Characteristics

III. Its Critics

IV. Mathematics: The Long View

V. Mathematical Challenge: Navigating High Dimensional Space

Page 2: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 5�

I. The other World Hunger Problem

(a) Sherlock

"Data, Data, Data!I can't make bricks without clay!"-- The Adventure of the Copper Beeches

(b) Quotation

Data, Data, Data! 6�

The coming Data Era

• Since beginning of history: unquenched desire for data

• Coming years, problem will be solved

• Prodigious efforts of best and brightest

• Massive constructions: ambitious and visionary

• Will define an Era

• Will permeate everything

Data, Data, Data! 7�

The Era of Faith

(a) Sistine Chapel (b) Notre Dame

Data, Data, Data! 8�

The coming Data Era

In place of cathedrals: massive investment for

• Ubiquitous sensors

• Pervasive networking

• Gargantuan databases

• Data-rich discourse and decision

Page 3: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 9�

The Data Era Touches Everything...

Touches even the most spiritual places

• Our Fate

• Our Creation

Data, Data, Data! 10�

I.1 Data and Notions of Fate

(a) AB 3700 (b) Celera Genomics

Data, Data, Data! 11�

Decoding the Genome

(a) Nature (b) Science

Data, Data, Data! 12�

Genomic Data Availability

• 109 Base pairs

• 30,000 Genes

• On-line 24*7

• Ubiquitous Software

• Internet Accessibility

Page 4: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 13�

Genomic Data will be touching your fate

• Will you (your favorite phobia here ...)

– Will you die young?

– Will your offspring share your traits?

– Will you react poorly to stress?

• Will we ... conquer disease, defects?

Data, Data, Data! 14�

I.2 Data and Notions of Creation

Data, Data, Data! 15�

Sloan Digital Sky Survey

(a) Sloan Instrument (b) Las Campanas Survey

Data, Data, Data! 16�

Sloan Facts (from www.sdss.org)

• Catalog larger by orders of magnitude tha perviou

• Data available on-line 24*7

• SDSS logo – ”Cosmic Genome Project”

Page 5: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 17�

Data Touch our Idea of Creation

• Sensors seeing back into the earliest moments

• Processing detects the first wrinkles in space/time

Data, Data, Data! 18�

Data Have Revolutionary Impact

• Today: one-time transition in human affairs

• You will be participants and shapers

Comparable Transition: Printing Press

• 50 million books in century 1450-1550

• Mass Communication, Art, Expression

• Exploration, Conquest, Empire

• Religious conflict, Total War

Data, Data, Data! 19�

II. Symbols of the Modern Era?

(a) DataCenter (b) Network

Data, Data, Data! 20�

True characteristics of the Modern Era

Ubiquitous, Pervasive Combination of

• Gathering Data

• Transporting

• Archiving

• Publishing

• Exploiting

Page 6: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 21�

II.1 Gathering Data

Everything Can and Will Become Data

• Smells

• Chemical Vision

• Motions, Gestures

• Preferences

• Essences

• Human Identity

• Behavior

Data, Data, Data! 22�

Electronic Nose (N. Lewis, Caltech)

Data, Data, Data! 23�

Electronic Nose Results

(a) (b)

Data, Data, Data! 24�

Hyperspectral Photography

(a) Eye (b) Spectrum

Page 7: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 25�

Hyperspectral Photography – Derived Image

Figure 11: Derived Image

Data, Data, Data! 26�

Human Motion Tracking – HiBall Tracker

(a) (b)

Data, Data, Data! 27�

SAR/Lidar Sensor Fusion

Data, Data, Data! 28�

New types of data: endless

• Cortical response to stimuli (fMRI)

• Internet Browsing Behavior

• Gait, Posture, Gesture

• NBA: mining videos for player moves

Page 8: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 29�

II.2 Ubiquitous and Pervasive Data Transport

Netcentric Communications

• Internet

• Wireless

• Bluetooth

Data Swamps Everything – Voice, Media etc.

Data, Data, Data! 30�

II.3 Ubiquitous and Pervasive Archiving Data

• Earthquakes

• Galaxies

• Genome

• ...

• Massive

• Comprehensive

• Complete

Data, Data, Data! 31�

Earthquake Catalogs

Data, Data, Data! 32�

II.4 Ubiquitous and Pervasive Publication of Data

(a) Forbidden City (b) Detained EP3 on CNN

Page 9: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 33�

Publication of High-Resolution Data

(a) 1.0 Meter Res (b) 0.5 Meter Res

Data, Data, Data! 34�

Publication of Financial Data

Data, Data, Data! 35�

II.5 Pervasive and Ubiquitous Exploitation

• Scientific

• Military

• Industry

• Investment Community

Data, Data, Data! 36�

Military need for Data

(a) (b)

Page 10: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 37�

Military Contributions to Data Era

• Computers

• Communications

• Networking

• Storage

• Display

Data, Data, Data! 38�

Modernizing War

(a) Publ Docs (b) Army Docs

Data, Data, Data! 39�

Navy Vision

Data, Data, Data! 40�

Unmanned Air Vehicles

Page 11: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 41�

Unmanned Ground Sensors (Concept)

Data, Data, Data! 42�

MicroMotes (Kris Pister, David Culler et al.)

(a) Mote (b) Network

Data, Data, Data! 43�

US Corporations and Data Era

• Size of IT marketplace $750B/year

• Size of IT budgets nearing 50% of total.

Data, Data, Data! 44�

Corporate Organization = Information Organization

• Merchandising

• Airlines

• Communications

• Shipping

• Manufacturing

Page 12: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 45�

WalMart: 7 Terabyte Transactions Database

(a) WalMart (b) Database

Data, Data, Data! 46�

Investment Community and Data Era

Massive Investment in startups for ...

• Network Infrastructure

• Genomic Databases

• Consumer Behavior Tracking

• Mass Personalization

Data, Data, Data! 47�

Example: Savage Beast (Oakland, CA)

(a) Query (b) Approach

Data, Data, Data! 48�

Compiling Musical Genome

Page 13: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 49�

Data, Data, Data! 50�

III Critics of the Data Era, 1

(a) Ralph Nader (b) Pope John

Data, Data, Data! 51�

III Critics of the Data Era, 2

Data, Data, Data! 52�

First Law of Data Smog

Information, once rare and cherished like caviar, is nowplentiful and take for granted like potatoes.

Page 14: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 53�

Typical Quotation from Data Smog

Tens of thousands of words daily pulse through ourbeleaguered brains. Accompanied by a massive amount ofother auditory and visual stimuli. In every moment of theaudio-visual orgy of our highly-informed days, the brainhandles a massive amount of electrical traffic. No wonderwe feel burnt

– Philosopher Philip Novak

Data, Data, Data! 54�

My Reaction

• Beginning of telephone era: people reported shock lasting daysafter a phone call

• Much opposition is simply romanticism

On the Other hand ...

• Serious looming problems in Data era;

• Typical critics ill-equipped to see, understand, or articulatethem.

Data, Data, Data! 55�

IV. The Long View – Mathematics

• Society is making massive bets ...

• Are these wise?

• Will the investment pay off?

• Where do you come in?

Data, Data, Data! 56�

A Common View of Technology

(a) 2001

"Any sufficiently advanced technology is indistinguishable from Magic"

-- Arthur C. Clarke

(b) Quote

Page 15: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 57�

USNA Responsibility

• You are Leaders of Tomorrow

• You must be Tough Customers

• Demand Good Investments

• Make Society’s Investments Pay Off

Mathematics Tells You

• Information technology is not magic

• Extracting Information from data is not a sure thing

• Specific hard work on a case-by-case basis

• You can learn what must be done

Data, Data, Data! 58�

Undergraduate/Graduate Courses Relevant

• Linear Algebra/Vector Analysis

• Statistics/Data Analysis

• Machine Learning/Data Mining

• Information Theory

• Signal Processing

Data, Data, Data! 59�

Mathematical Viewpoint about Database

• Datasets == arrays of numbers

• Array of numbers == points clouds in space

• Exploitation of Data == Recognizing patterns in point clouds

Data, Data, Data! 60�

Statistical Dataset

• N Individuals, p variables

• X(i, j) data 1 ≤ i ≤ N , 1 ≤ j ≤ p

Example Mathematics Student Exams

• N = 25 Students, p = 5 measured variables

• X(i, 1): Diff. Geometry score of student i

• X(i, 2): Complex Analysis score of student i

• X(i, 3): Algebra score of student i

• X(i, 4): Real Analysis score of student i

• X(i, 5): Statistics score of student i

Page 16: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 61�

ID # Diff. Complex Algebra Real Statistics

Geom. Analysis Analysis

1 36 58 43 36 37

2 62 54 50 46 52

3 31 42 41 40 29

4 76 78 69 66 81

5 46 56 52 56 40

6 12 42 38 38 28

7 39 46 51 54 41

. . . . . . . . . . . . . . . . . .

Data, Data, Data! 62�

Dataset as a point cloud in 2-d

0 10 20 30 40 50 60 70 8010

20

30

40

50

60

70

80

90

Differential Geometry Scores

Sta

tistic

s S

core

s

Graduate Student Exam Scores

Data, Data, Data! 63�

Dataset as a point cloud in 3-d

0

20

40

60

80 0

20

40

60

80

10030

35

40

45

50

55

60

65

70

Statistics

Graduate Student Exam Scores

Differential Geometry

Alg

ebra

Data, Data, Data! 64�

Many other Statistical Datasets

• N Individuals, p variables

• X(i, j) data 1 ≤ i ≤ N , 1 ≤ j ≤ p

Examples

• Medical: Individuals = Patients, Variables = Age, BloodPressure, ...

• Financial: Individuals = Stocks, Variables = P/E Ratio,Earnings Growth (APR), ...

• Sports: Individuals = Athletes, Variables = AB, H, W, BA,RBI, HR, ...

Page 17: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 65�

Some Common Patterns in Point Clouds

• Planes

• Clusters

• Outliers

• Filaments

Data Analysis: Finding and Interpreting such Patterns

Data, Data, Data! 66�

Interlude: MacSpin

Data, Data, Data! 67�

Society’s Bets

• Socially useful datasets == point clouds in space

• Exploitation of Data == Recognizing patterns in point clouds

• We will recognize patterns in point clouds which allowexploitation

Data, Data, Data! 68�

V. The True Challenge: High Dimensionality

Modern Data have high volumes

• A Sound: > 3KHz

• An Image: >1M Pixels

Each ’Data Point’ in a database is a highly multidimensional object

• An Utterance: >50K dimensions

• A Face: > 200K dimensions

Page 18: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 69�

Example Modern Databases

Human Identity

• For each person, a 256 ∗ 256 image

• N = 106 individuals (points)

• p = 256 ∗ 256 = 65536 variables (dimensions)

Hyperspectral Image

• For each chemical, a 1024-long spectrum

• N = 5000 compounds (points)

• p = 1024 variables (dimensions)

Data, Data, Data! 70�

Needs with High Dimensional data

• Visualizing high dimensional data

• Navigating high dimensional space

• Discovering high dimensional patterns

Data, Data, Data! 71�

Can We Visualize High Dimensions?

Data, Data, Data! 72�

Page 19: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 73�

Data, Data, Data! 74�

�Data, Data, Data! 75�

Data, Data, Data! 76�

Page 20: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 77�

Data, Data, Data! 78�

�Data, Data, Data! 79�

Data, Data, Data! 80�

Page 21: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 81�

Data, Data, Data! 82�

We must Reduce Dimensionality!!

Idea: Project onto Subspace

Data, Data, Data! 83�

Approaches to Reduce Dimensionality

• Ideas from Linear Algebra and Statistics

• Choose new coordinate system (basis)

• Rotate into that basis

• Use only a few of the new coordinates

Data, Data, Data! 84�

Illustration of Rotate/Drop Coordinates

-20

-10

0

10

20

-20

-10

0

10

20-20

-10

0

10

20

Page 22: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 85�

Approaches to Reduce Dimensionality

• V.I Principal Component Analysis

• V.II Multiresolution Analysis

Data, Data, Data! 86�

Dataset as a point cloud in 2-d

0 10 20 30 40 50 60 70 8010

20

30

40

50

60

70

80

90

Differential Geometry Scores

Sta

tistic

s S

core

s

Graduate Student Exam Scores

Data, Data, Data! 87�

Concentration Ellipsoid and Principal Axes

-20 0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

Differential Geometry Scores

Sta

tistic

s S

core

s

Graduate Student Exam Scores

Data, Data, Data! 88�

Principal Components

• Concentration Ellipsoid

• Principal Axes

• Longer Axes – more variation

Sometimes

• Few Axes – most variation

• Summarize many variables by a few.

Page 23: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 89�

PCA Relies on ...

• Linear Algebra (eigenvalues of positive definite matrices)

• Multivariate Statistics (covariance matrices)

• Numerical Computing (eigenvalue solvers)

Used throughout Science and Technology

• Chemometrics

• Hyperspectral Imagery

• Human Identification

Data, Data, Data! 90�

PCA Output on eNose

Data, Data, Data! 91�

PCA Output on Hyperspectral Imagery

Data, Data, Data! 92�

Eigenfaces

(a) UBC Students (b) Principal faces

Page 24: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 93�

Can we discover ...

• ... from data

• ... the true cooordinate system

• Recent approach: ICA

• Uses different statistical principles

Data, Data, Data! 94�

Example: Facial Gestures

Data, Data, Data! 95�

(a) PCA (b) ICA

Sejnowski Lab, Salk Inst.

Data, Data, Data! 96�

V.II Dimensionality Reduction by Multiresolution Anal

• New coordinate system: not pixels but (scale,location)

• Break Data into coarse/medium/fine scales

• Select information by scale/location

• Alter egos: wavelets, multiscale methods

Page 25: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 97�

BigMac

Bandpass BigMac, s=1

Bandpass, s=2

Bandpass, s=3

Partitioned, s=2

Partitioned, s=3 Ridgelet Coeff, s=2

Ridgelet Coeff, s=2

Figure 25: BigMac Image, and stages of Curvelet Analysis.

Data, Data, Data! 98�

�Figure 26: Fingerprint Movie. 64W + 0/256/768/3072 Curvelets

Data, Data, Data! 99�

Figure 27: Lichtenstein, ‘In The Car’; 64 Wavelets+ 256 Curvelets

Data, Data, Data! 100�

V.III Nonlinear Structures in High-D Space

(a) Scatter (b) Manifold

Page 26: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 101�

Example of Nonlinear Structure: Images

Data, Data, Data! 102�

Can We Discover ...

• ... from data

• ... an underlying nonlinear coordinate system

Relevant Courses

• Linear Algebra

• Multivariate Statistics

• Differential Geometry of Curves and Surfaces

Data, Data, Data! 103�

ISOMAP (Josh Tenenbaum, Stanford)

Data, Data, Data! 104�

Many Reasons to Navigate High Dimensions

• Locate Best Matches in Multimedia Databases

• Steganography

• Understand tactical tendencies of opponent

• Maneuver planning

Many Methods to Navigate High Dimensions

• Sonification

• Diverse Mathematical Ideas

Page 27: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 105�

Summary

• Entering Era of Data

• Serious Challenges: Data Smog, Data Glut

• Mathematical Challenges:navigate, explore high dimensional space

• Only scratched surface in this lecture

• Progress important for Navy Goals and Society in General

• You can contribute