Download - the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Transcript
Page 1: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 1�

USNA Michelson Lecture 2001

Data, Data, Data!

Challenges and Opportunities of

the coming Data Deluge

David Donoho, Statistics Department, Stanford University

Data, Data, Data! 2�

Thanks to ...

• Office of Naval Research:

– Reza Malek Madani

– Wen Masters

– Carey Schwartz

• Will Glaser, Savage Beast

• Jeffrey Klein, Xamplify

• John Lavery, Army Research Office

• Tracy McSheery, PhaseSpace

• Peter Weinberger, Renaissance Technologies

• Shankar Sastry, UC Berkeley

Data, Data, Data! 3�

Data, Data, Data! 4�

Agenda

I. The Era of Data

II. Its Characteristics

III. Its Critics

IV. Mathematics: The Long View

V. Mathematical Challenge: Navigating High Dimensional Space

Page 2: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 5�

I. The other World Hunger Problem

(a) Sherlock

"Data, Data, Data!I can't make bricks without clay!"-- The Adventure of the Copper Beeches

(b) Quotation

Data, Data, Data! 6�

The coming Data Era

• Since beginning of history: unquenched desire for data

• Coming years, problem will be solved

• Prodigious efforts of best and brightest

• Massive constructions: ambitious and visionary

• Will define an Era

• Will permeate everything

Data, Data, Data! 7�

The Era of Faith

(a) Sistine Chapel (b) Notre Dame

Data, Data, Data! 8�

The coming Data Era

In place of cathedrals: massive investment for

• Ubiquitous sensors

• Pervasive networking

• Gargantuan databases

• Data-rich discourse and decision

Page 3: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 9�

The Data Era Touches Everything...

Touches even the most spiritual places

• Our Fate

• Our Creation

Data, Data, Data! 10�

I.1 Data and Notions of Fate

(a) AB 3700 (b) Celera Genomics

Data, Data, Data! 11�

Decoding the Genome

(a) Nature (b) Science

Data, Data, Data! 12�

Genomic Data Availability

• 109 Base pairs

• 30,000 Genes

• On-line 24*7

• Ubiquitous Software

• Internet Accessibility

Page 4: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 13�

Genomic Data will be touching your fate

• Will you (your favorite phobia here ...)

– Will you die young?

– Will your offspring share your traits?

– Will you react poorly to stress?

• Will we ... conquer disease, defects?

Data, Data, Data! 14�

I.2 Data and Notions of Creation

Data, Data, Data! 15�

Sloan Digital Sky Survey

(a) Sloan Instrument (b) Las Campanas Survey

Data, Data, Data! 16�

Sloan Facts (from www.sdss.org)

• Catalog larger by orders of magnitude tha perviou

• Data available on-line 24*7

• SDSS logo – ”Cosmic Genome Project”

Page 5: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 17�

Data Touch our Idea of Creation

• Sensors seeing back into the earliest moments

• Processing detects the first wrinkles in space/time

Data, Data, Data! 18�

Data Have Revolutionary Impact

• Today: one-time transition in human affairs

• You will be participants and shapers

Comparable Transition: Printing Press

• 50 million books in century 1450-1550

• Mass Communication, Art, Expression

• Exploration, Conquest, Empire

• Religious conflict, Total War

Data, Data, Data! 19�

II. Symbols of the Modern Era?

(a) DataCenter (b) Network

Data, Data, Data! 20�

True characteristics of the Modern Era

Ubiquitous, Pervasive Combination of

• Gathering Data

• Transporting

• Archiving

• Publishing

• Exploiting

Page 6: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 21�

II.1 Gathering Data

Everything Can and Will Become Data

• Smells

• Chemical Vision

• Motions, Gestures

• Preferences

• Essences

• Human Identity

• Behavior

Data, Data, Data! 22�

Electronic Nose (N. Lewis, Caltech)

Data, Data, Data! 23�

Electronic Nose Results

(a) (b)

Data, Data, Data! 24�

Hyperspectral Photography

(a) Eye (b) Spectrum

Page 7: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 25�

Hyperspectral Photography – Derived Image

Figure 11: Derived Image

Data, Data, Data! 26�

Human Motion Tracking – HiBall Tracker

(a) (b)

Data, Data, Data! 27�

SAR/Lidar Sensor Fusion

Data, Data, Data! 28�

New types of data: endless

• Cortical response to stimuli (fMRI)

• Internet Browsing Behavior

• Gait, Posture, Gesture

• NBA: mining videos for player moves

Page 8: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 29�

II.2 Ubiquitous and Pervasive Data Transport

Netcentric Communications

• Internet

• Wireless

• Bluetooth

Data Swamps Everything – Voice, Media etc.

Data, Data, Data! 30�

II.3 Ubiquitous and Pervasive Archiving Data

• Earthquakes

• Galaxies

• Genome

• ...

• Massive

• Comprehensive

• Complete

Data, Data, Data! 31�

Earthquake Catalogs

Data, Data, Data! 32�

II.4 Ubiquitous and Pervasive Publication of Data

(a) Forbidden City (b) Detained EP3 on CNN

Page 9: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 33�

Publication of High-Resolution Data

(a) 1.0 Meter Res (b) 0.5 Meter Res

Data, Data, Data! 34�

Publication of Financial Data

Data, Data, Data! 35�

II.5 Pervasive and Ubiquitous Exploitation

• Scientific

• Military

• Industry

• Investment Community

Data, Data, Data! 36�

Military need for Data

(a) (b)

Page 10: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 37�

Military Contributions to Data Era

• Computers

• Communications

• Networking

• Storage

• Display

Data, Data, Data! 38�

Modernizing War

(a) Publ Docs (b) Army Docs

Data, Data, Data! 39�

Navy Vision

Data, Data, Data! 40�

Unmanned Air Vehicles

Page 11: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 41�

Unmanned Ground Sensors (Concept)

Data, Data, Data! 42�

MicroMotes (Kris Pister, David Culler et al.)

(a) Mote (b) Network

Data, Data, Data! 43�

US Corporations and Data Era

• Size of IT marketplace $750B/year

• Size of IT budgets nearing 50% of total.

Data, Data, Data! 44�

Corporate Organization = Information Organization

• Merchandising

• Airlines

• Communications

• Shipping

• Manufacturing

Page 12: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 45�

WalMart: 7 Terabyte Transactions Database

(a) WalMart (b) Database

Data, Data, Data! 46�

Investment Community and Data Era

Massive Investment in startups for ...

• Network Infrastructure

• Genomic Databases

• Consumer Behavior Tracking

• Mass Personalization

Data, Data, Data! 47�

Example: Savage Beast (Oakland, CA)

(a) Query (b) Approach

Data, Data, Data! 48�

Compiling Musical Genome

Page 13: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 49�

Data, Data, Data! 50�

III Critics of the Data Era, 1

(a) Ralph Nader (b) Pope John

Data, Data, Data! 51�

III Critics of the Data Era, 2

Data, Data, Data! 52�

First Law of Data Smog

Information, once rare and cherished like caviar, is nowplentiful and take for granted like potatoes.

Page 14: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 53�

Typical Quotation from Data Smog

Tens of thousands of words daily pulse through ourbeleaguered brains. Accompanied by a massive amount ofother auditory and visual stimuli. In every moment of theaudio-visual orgy of our highly-informed days, the brainhandles a massive amount of electrical traffic. No wonderwe feel burnt

– Philosopher Philip Novak

Data, Data, Data! 54�

My Reaction

• Beginning of telephone era: people reported shock lasting daysafter a phone call

• Much opposition is simply romanticism

On the Other hand ...

• Serious looming problems in Data era;

• Typical critics ill-equipped to see, understand, or articulatethem.

Data, Data, Data! 55�

IV. The Long View – Mathematics

• Society is making massive bets ...

• Are these wise?

• Will the investment pay off?

• Where do you come in?

Data, Data, Data! 56�

A Common View of Technology

(a) 2001

"Any sufficiently advanced technology is indistinguishable from Magic"

-- Arthur C. Clarke

(b) Quote

Page 15: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 57�

USNA Responsibility

• You are Leaders of Tomorrow

• You must be Tough Customers

• Demand Good Investments

• Make Society’s Investments Pay Off

Mathematics Tells You

• Information technology is not magic

• Extracting Information from data is not a sure thing

• Specific hard work on a case-by-case basis

• You can learn what must be done

Data, Data, Data! 58�

Undergraduate/Graduate Courses Relevant

• Linear Algebra/Vector Analysis

• Statistics/Data Analysis

• Machine Learning/Data Mining

• Information Theory

• Signal Processing

Data, Data, Data! 59�

Mathematical Viewpoint about Database

• Datasets == arrays of numbers

• Array of numbers == points clouds in space

• Exploitation of Data == Recognizing patterns in point clouds

Data, Data, Data! 60�

Statistical Dataset

• N Individuals, p variables

• X(i, j) data 1 ≤ i ≤ N , 1 ≤ j ≤ p

Example Mathematics Student Exams

• N = 25 Students, p = 5 measured variables

• X(i, 1): Diff. Geometry score of student i

• X(i, 2): Complex Analysis score of student i

• X(i, 3): Algebra score of student i

• X(i, 4): Real Analysis score of student i

• X(i, 5): Statistics score of student i

Page 16: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 61�

ID # Diff. Complex Algebra Real Statistics

Geom. Analysis Analysis

1 36 58 43 36 37

2 62 54 50 46 52

3 31 42 41 40 29

4 76 78 69 66 81

5 46 56 52 56 40

6 12 42 38 38 28

7 39 46 51 54 41

. . . . . . . . . . . . . . . . . .

Data, Data, Data! 62�

Dataset as a point cloud in 2-d

0 10 20 30 40 50 60 70 8010

20

30

40

50

60

70

80

90

Differential Geometry Scores

Sta

tistic

s S

core

s

Graduate Student Exam Scores

Data, Data, Data! 63�

Dataset as a point cloud in 3-d

0

20

40

60

80 0

20

40

60

80

10030

35

40

45

50

55

60

65

70

Statistics

Graduate Student Exam Scores

Differential Geometry

Alg

ebra

Data, Data, Data! 64�

Many other Statistical Datasets

• N Individuals, p variables

• X(i, j) data 1 ≤ i ≤ N , 1 ≤ j ≤ p

Examples

• Medical: Individuals = Patients, Variables = Age, BloodPressure, ...

• Financial: Individuals = Stocks, Variables = P/E Ratio,Earnings Growth (APR), ...

• Sports: Individuals = Athletes, Variables = AB, H, W, BA,RBI, HR, ...

Page 17: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 65�

Some Common Patterns in Point Clouds

• Planes

• Clusters

• Outliers

• Filaments

Data Analysis: Finding and Interpreting such Patterns

Data, Data, Data! 66�

Interlude: MacSpin

Data, Data, Data! 67�

Society’s Bets

• Socially useful datasets == point clouds in space

• Exploitation of Data == Recognizing patterns in point clouds

• We will recognize patterns in point clouds which allowexploitation

Data, Data, Data! 68�

V. The True Challenge: High Dimensionality

Modern Data have high volumes

• A Sound: > 3KHz

• An Image: >1M Pixels

Each ’Data Point’ in a database is a highly multidimensional object

• An Utterance: >50K dimensions

• A Face: > 200K dimensions

Page 18: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 69�

Example Modern Databases

Human Identity

• For each person, a 256 ∗ 256 image

• N = 106 individuals (points)

• p = 256 ∗ 256 = 65536 variables (dimensions)

Hyperspectral Image

• For each chemical, a 1024-long spectrum

• N = 5000 compounds (points)

• p = 1024 variables (dimensions)

Data, Data, Data! 70�

Needs with High Dimensional data

• Visualizing high dimensional data

• Navigating high dimensional space

• Discovering high dimensional patterns

Data, Data, Data! 71�

Can We Visualize High Dimensions?

Data, Data, Data! 72�

Page 19: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 73�

Data, Data, Data! 74�

�Data, Data, Data! 75�

Data, Data, Data! 76�

Page 20: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 77�

Data, Data, Data! 78�

�Data, Data, Data! 79�

Data, Data, Data! 80�

Page 21: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 81�

Data, Data, Data! 82�

We must Reduce Dimensionality!!

Idea: Project onto Subspace

Data, Data, Data! 83�

Approaches to Reduce Dimensionality

• Ideas from Linear Algebra and Statistics

• Choose new coordinate system (basis)

• Rotate into that basis

• Use only a few of the new coordinates

Data, Data, Data! 84�

Illustration of Rotate/Drop Coordinates

-20

-10

0

10

20

-20

-10

0

10

20-20

-10

0

10

20

Page 22: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 85�

Approaches to Reduce Dimensionality

• V.I Principal Component Analysis

• V.II Multiresolution Analysis

Data, Data, Data! 86�

Dataset as a point cloud in 2-d

0 10 20 30 40 50 60 70 8010

20

30

40

50

60

70

80

90

Differential Geometry Scores

Sta

tistic

s S

core

s

Graduate Student Exam Scores

Data, Data, Data! 87�

Concentration Ellipsoid and Principal Axes

-20 0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

Differential Geometry Scores

Sta

tistic

s S

core

s

Graduate Student Exam Scores

Data, Data, Data! 88�

Principal Components

• Concentration Ellipsoid

• Principal Axes

• Longer Axes – more variation

Sometimes

• Few Axes – most variation

• Summarize many variables by a few.

Page 23: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 89�

PCA Relies on ...

• Linear Algebra (eigenvalues of positive definite matrices)

• Multivariate Statistics (covariance matrices)

• Numerical Computing (eigenvalue solvers)

Used throughout Science and Technology

• Chemometrics

• Hyperspectral Imagery

• Human Identification

Data, Data, Data! 90�

PCA Output on eNose

Data, Data, Data! 91�

PCA Output on Hyperspectral Imagery

Data, Data, Data! 92�

Eigenfaces

(a) UBC Students (b) Principal faces

Page 24: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 93�

Can we discover ...

• ... from data

• ... the true cooordinate system

• Recent approach: ICA

• Uses different statistical principles

Data, Data, Data! 94�

Example: Facial Gestures

Data, Data, Data! 95�

(a) PCA (b) ICA

Sejnowski Lab, Salk Inst.

Data, Data, Data! 96�

V.II Dimensionality Reduction by Multiresolution Anal

• New coordinate system: not pixels but (scale,location)

• Break Data into coarse/medium/fine scales

• Select information by scale/location

• Alter egos: wavelets, multiscale methods

Page 25: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 97�

BigMac

Bandpass BigMac, s=1

Bandpass, s=2

Bandpass, s=3

Partitioned, s=2

Partitioned, s=3 Ridgelet Coeff, s=2

Ridgelet Coeff, s=2

Figure 25: BigMac Image, and stages of Curvelet Analysis.

Data, Data, Data! 98�

�Figure 26: Fingerprint Movie. 64W + 0/256/768/3072 Curvelets

Data, Data, Data! 99�

Figure 27: Lichtenstein, ‘In The Car’; 64 Wavelets+ 256 Curvelets

Data, Data, Data! 100�

V.III Nonlinear Structures in High-D Space

(a) Scatter (b) Manifold

Page 26: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 101�

Example of Nonlinear Structure: Images

Data, Data, Data! 102�

Can We Discover ...

• ... from data

• ... an underlying nonlinear coordinate system

Relevant Courses

• Linear Algebra

• Multivariate Statistics

• Differential Geometry of Curves and Surfaces

Data, Data, Data! 103�

ISOMAP (Josh Tenenbaum, Stanford)

Data, Data, Data! 104�

Many Reasons to Navigate High Dimensions

• Locate Best Matches in Multimedia Databases

• Steganography

• Understand tactical tendencies of opponent

• Maneuver planning

Many Methods to Navigate High Dimensions

• Sonification

• Diverse Mathematical Ideas

Page 27: the coming Data Deluge Challenges and Opportunities of ...lekheng/courses/191f09/donoho.pdf · 6 individuals (points) • p = 256 ∗ 256 = 65536 variables (dimensions) Hyperspectral

Data, Data, Data! 105�

Summary

• Entering Era of Data

• Serious Challenges: Data Smog, Data Glut

• Mathematical Challenges:navigate, explore high dimensional space

• Only scratched surface in this lecture

• Progress important for Navy Goals and Society in General

• You can contribute