What to Do? A Research Agenda

46
What to Do? A Research Agenda Jim Gray Microsoft Research

description

What to Do? A Research Agenda. Jim Gray Microsoft Research. “ Everything that can be invented has been invented.” Commissioner, U.S. Office of Patents, 1899. - PowerPoint PPT Presentation

Transcript of What to Do? A Research Agenda

Page 1: What to Do?  A Research Agenda

What to Do? A Research Agenda

Jim Gray

Microsoft Research

Page 2: What to Do?  A Research Agenda

“Everything that can be invented has been invented.” Commissioner, U.S. Office of Patents, 1899

From http://inventors.about.com/library/lessons/bl_appendix5.htm

…absolutely no basis to support Duell's alleged statement. Just the opposite is true. Duell's 1899 report documents an increase of about 3,000 patents over the previous year, and nearly 60 times the number granted in 1837. Further, Duell quotes President McKinley's annual message saying, "Our future progress and prosperity depend upon our ability to equal, if not surpass, other nations in the enlargement and advance of science, industry and commerce. To invention we must turn as one of the most powerful aids to the accomplishment of such a result." Duell adds, "May not our inventors hopefully look to the Fifty-sixth Congress for aid and effectual encouragement in improving the American patent system?" These are unlikely words of someone who thinks that everything has been invented.

Page 3: What to Do?  A Research Agenda

“We have patents on the Byte and the Algorithm”

Dave Huffman

• EE has the electron• Physics has matter/energy• Chemistry has molecules and reactions• Economics has the transaction.

• They all need us to do anything

• The IT revolution is just starting.

Page 4: What to Do?  A Research Agenda

Science The Endless Frontier Vannevar Bush -> Harry Truman, July 1945

1. Introduction:

Scientific Progress is Essential Science is a Proper Concern of Government Government Relations to Science - Past and Future Freedom of Inquiry Must be Preserved

2. The War Against Disease: In War In Peace Unsolved Problems Broad and Basic Studies Needed Coordinated Attack on Special Problems Action is Necessary

3. Science and the Public Welfare: Relation to National Security

Science and Jobs The Importance of Basic Research Centers of Basic Research Research Within the Government Industrial Research International Exchange of Scientific Information The Special Need for Federal Support The Cost of a Program

4. Renewal of our Scientific Talent: Nature of the Problem A Note of Warning The Wartime Deficit Improve the Quality Remove the Barriers The Generation in Uniform Must Not be Lost A Program

5. A Problem of Scientific Reconversion: Effects of Mobilization of Science for War Security Restrictions Should be Lifted Promptly Need for Coordination A Board to Control Release Publication Should be Encouraged

6. The Means to the End: New Responsibilities for Government The Mechanism Five Fundamentals Military Research

National Research Foundation

http://www.nsf.gov/od/lpa/nsf50/vbush1945.htm

Page 5: What to Do?  A Research Agenda

Request From George Djorgovski

• What do you think we (CACR) should do?– Long term– 2 or 3 focus areas– Leverage our skills.

Page 6: What to Do?  A Research Agenda

Honest Answer• I do not know.

• Hire smart over-achievers

• Give them (barely) enough resources

• Ask them what they have accomplished

• Foster mutual respect

• Take credit for the successes

• Allow some of them to fail.

Page 7: What to Do?  A Research Agenda

Talking the Talk and Walking the Walk

• We have 7 people – what’s our agenda?

• Observe it is all about:Data → Information → Knowledge → WisdomPeople == Communication is the “killer app”

• ½ Personal Information Management

• ½ Corporate Information Management

Page 8: What to Do?  A Research Agenda

MemexAs We May Think, Vannevar Bush, 1945

“A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility”

“yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can be profligate and enter material freely”

Page 9: What to Do?  A Research Agenda

25Kday life ~ Personal Petabyte

0.001

0.01

0.1

1.

10.

100.

1000.

TB

Msgs webpages

Tifs Books jpegs 1KBpssound

music Videos

Lifetime Storage 1PB

Will anyone look at web pages in 2020? Probably new modalities & media will dominate then.

Page 10: What to Do?  A Research Agenda

Challenges

• Capture: Get the bits in

• Organize: Index them

• Manage: No worries about loss or space

• Curate/ Annotate: automate where possible

• Privacy: Keep safe from theft / disclosure.

• Summarize: Give thumbnail summaries

• Interface: how ask/anticipate questions

• Present: show it in understandable ways.

Page 11: What to Do?  A Research Agenda

MyLifeBits Software

MyLifeBits store

database

Voice Voice annotation annotation tooltool

Text Text annotation annotation tooltool

Telephone Telephone capture toolcapture tool

TV capture TV capture tooltool

TV EPG TV EPG download download tooltool

Radio Radio capture toolcapture tool

Radio EPG Radio EPG tooltool

PocketPC PocketPC transfer transfer tooltool

PocketRadio PocketRadio playerplayer

Import filesImport files

MyLifeBits MyLifeBits ShellShell

files

Legacy Legacy applicationsapplications

Browser Browser tooltool

InternetInternet

IM captureIM capture

MAPI MAPI interfaceinterface

Legacy Legacy email clientemail client

Page 12: What to Do?  A Research Agenda

MyLifeBits Interesting Ideashttp://www.research.microsoft.com/barc/mediapresence/MyLifeBits.aspx

• Capture is “easy”– All TV, all Phone, all web pages, all mail,….– Senscam: a picture a minute– GPS corrleation (and other correlation)

• Interesting search strategies– Pivot and cluster

• Struggling with metadata (annotations)• Interesting visualizations

(ambiance, timelines, spatial, conceptual).• Very topical

interest from public, industry, developers

Page 13: What to Do?  A Research Agenda

Relevance To CACR

• (re)Invent the laboratory notebook

• Scientists are still mostly working with paper and pencil (especially in the lab).

• Change the way scientists do information management.

• Give them good analysis/visualization tools

• Allow them to publish their data/workbooks

Page 14: What to Do?  A Research Agenda

80% of data is personal / individual. But, what about the other 20%?

• Business– Wall Mart online: 1PB and growing….– Paradox: most “transaction” systems < 1 PB.– Have to go to image/data monitoring for big data

• Government– Government is the biggest business.

• Science– LOTS of data.

Page 15: What to Do?  A Research Agenda

Data Challenges I'm Struggling With

1. Sneakernet is probably the best way to moving WAN data at 1GBps File transfer efforts are currently 550MBps via Internet2. How to manage the multi-petybyte file repository we are about to generate.

2. The TerraServer has evolved from a mainframe to a bunch of bricks. The new design has been operating for a year and we are quite pleased with it. But we face "how-do-you-manage a bunch?" and what the best geoplex strategy?.

3. The SkyServer website is built using database technology and web services. Now moving the web services inside the database. Others are working to design a scale-out version of the server. There are several interesting data challenges in these changes.

4. Using relational tuples to represent spatial volumes as constraints. Point-in-polygon and polygon-overlap queries can then be quickly evaluated. I will briefly describe this idea.

Page 16: What to Do?  A Research Agenda

How Do You Move A Terabyte?

14 minutes6172001,920,0009600OC 192

2.2 hours1000Gbps

1 day100100 Mpbs

14 hours97631649,000155OC3

2 days2,01065128,00043T3

2 months2,4698001,2001.5T1

5 months360117500.6Home DSL

6 years3,0861,000400.04Home phone

Time/TB$/TBSent

$/MbpsRent

$/monthSpeedMbps

Context

Source: TeraScale Sneakernet, Source: TeraScale Sneakernet, Microsoft Technical Report May 2002, MSR-TR-2002-54 http://research.microsoft.com/research/pubs/view.aspx?tr_id=569

Page 17: What to Do?  A Research Agenda

Moving Data Bricks• WAN costs >> 100$/Mbps/month

>> 1$/GB • Beowulf networking

10,000x cheaper than WAN factors of 105 matter.

• The cheapest and fastest way to move a Terabyte cross country is sneakernet.24 hours = 4 MB/s50$ shipping vs 1,000$ wan cost.

Page 18: What to Do?  A Research Agenda

Giga Byte Per Second File Mover

• CERN to Pasadena– Windows TCP/IP stack improvements– Opteron demo– Disk-to-Disk at 550MBps now (~2 TB/Hour)

– Near the PCI-X limit.

• GOAL: 1GBps disk-to-disk – 75% there

OC192 = 9.9 Gbps

CERN-Caltech Trasfer SpeedsNewisys->Newisys

0

100

200

300

400

500

600

700

800

900

1000

Mar-04 May-04 Jun-04 Aug-04 Sep-04

MB

ps

File Transfer MBps1 Stream tcp MBps

PCI -X limit

tcp limit

CERN-Caltech Trasfer SpeedsNewisys->Newisys

0

100

200

300

400

500

600

700

800

900

1000

Mar-04 Jun-04 Sep-04 Jan-05 Apr-05

MB

ps

File Transfer MBps1 Stream tcp MBps

Page 19: What to Do?  A Research Agenda

But then what?Managing Petabytes

• CERN files are 30MB

• They produce 1 B files/year.

• How name them?

• How manage them?

• Depends on workload: how use them.

• It’s a DB problem.

Page 20: What to Do?  A Research Agenda

Data Challenges I'm Struggling With

Jim Gray, Microsoft Research 1. Sneakernet is probably the best way to moving WAN data at 1GBps

File transfer efforts are currently 550MBps via Internet2. How to manage the multi-petybyte file repository we are about to generate.

2. The TerraServer has evolved from a mainframe to a bunch of bricks. The new design has been operating for a year and we are quite pleased with it. But we face "how-do-you-manage a bunch?" and what the best geoplex strategy?.

3. The SkyServer website is built using database technology and web services. Now moving the web services inside the database. Others are working to design a scale-out version of the server. There are several interesting data challenges in these changes.

4. Using relational tuples to represent spatial volumes as constraints. Point-in-polygon and polygon-overlap queries can then be quickly evaluated. I will briefly describe this idea.

Page 21: What to Do?  A Research Agenda

TerraServer / TerraServicehttp://terraService.Net/ http://TerraServer-USA.com/

• USGS Photo of US

• Online since June 1998

• Operated by Microsoft • 20 TB data source• 10 M web hits/day

• A web service

• Our laboratory

Page 22: What to Do?  A Research Agenda

KVM / IPKVM / IP

TerraServer – What’s new• Web Service and Web Server• New ~1 ft2/pixel full color image

of 120 urban areas• Storage Bricks

– Commodity servers”– 4 TB raw / 2 TB Raid1 SATA storage– Dual 2 GHz + 4GB RAM– 3 Bricks = TerraServer data – Data partitioned – Moving to Yukon– Working on low TCO

auto-manage • Low Cost Availability Pair & Spare

– RAID1 Mirroring– Mirrored Bunches (Yukon log ship?) – Spare Brick– Web Application

• Load balances mirrors• Uses surviving database on failure

Page 23: What to Do?  A Research Agenda

TerraServer Challenges

• Best Geoplex strategy?

• Moving Web Services into the DB?

• Managing bunches (lower TCO).

Page 24: What to Do?  A Research Agenda

World Wide Telescope

• Premise: Most Astronomy data is online • So, the Internet is the world’s best telescope:

– It has data on every part of the sky– In every measured spectral band: – As deep as the best instruments (2 years ago).– It is up when you are up.

The “seeing” is always great (no working at night, no clouds no moons no..).

– It’s a smart telescope: links objects and data to literature on them.

Page 25: What to Do?  A Research Agenda

SkyServer.SDSS.orgBuilt with Johns Hopkins U.

• A modern archive– Raw Pixel data lives in file servers– Catalog data (derived objects) in Database– Online query to any and all

• Also used for education– 150 hours of online Astronomy– Implicitly teaches data analysis

• Interesting things– Spatial data search– Query interface via Emacs, Perl, Java…– Popular -- 1% of Terraserver – Cloned by other surveys (a template design) – Based on Web Services

Page 26: What to Do?  A Research Agenda

Quick Overview (Services) • SkyServer (skyserver.sdss.org)

– Web site delivers Sloan Digital Sky Survey data– Also has education– 1,000x less popular than Terraserver,

but HUGE for a science website.• A Batch Job System with Personal DBs

– Lets users run jobs http://casjobs.sdss.org/CasJobs/– Parameters & Answers to & from Personal DB– Simple batch job scheduler.

• Web Services: http://www.voservices.org/

– Photographic objects– Spectrographic objects– Transformation functions– 7 out of the 8 are .NET.

Page 27: What to Do?  A Research Agenda

Federation: SkyQuery.Net• Combines 15 archives

• Send query to portal, portal joins data from archives.

• Evolving Portal to have – Personal databases (workbenches)

– Batch scheduling of monster queries.

2MASS

INT

SDSS

FIRST

SkyQueryPortal

ImageCutout

Page 28: What to Do?  A Research Agenda

The Data Challenges

• Parallel data search (data pump).How to partition?How manage load

• Moving web services to DB What is the right approach?

• Move objects into DBSpatial access methodsData analysis in the DB.

• Managing Petabytes

Page 29: What to Do?  A Research Agenda

The Knowledge Challenge

• We need to “objectify science”• What is a Gene? Star? River? Cell? …

– What is the definition (formal)– What attributes do they have?– What are their dynamics?

• Defining this (the “O” word) has to happen in each discipline.

• This is the Knowledge level (rather than the data level)

Page 30: What to Do?  A Research Agenda

Data Challenges I'm Struggling With

1. Sneakernet is probably the best way to moving WAN data at 1GBps File transfer efforts are currently 550MBps via Internet2. How to manage the multi-petybyte file repository we are about to generate.

2. The TerraServer has evolved from a mainframe to a bunch of bricks. The new design has been operating for a year and we are quite pleased with it. But we face "how-do-you-manage a bunch?" and what the best geoplex strategy?.

3. The SkyServer website is built using database technology and web services. Now moving the web services inside the database. Others are working to design a scale-out version of the server. There are several interesting data challenges in these changes.

4. Using relational tuples to represent spatial volumes as constraints. Point-in-polygon and polygon-overlap queries can then be quickly evaluated. I will briefly describe this idea.

Page 31: What to Do?  A Research Agenda

A Detail: 3 Ways We Do Spatial?• Hierarchical mesh (extension to SQL)

– Uses table valued stored procedures– Acts as a new “spatial access method”– Porting to Yukon CLR for a 10x speedup.

• Zones: fits SQL like a glove– Amazingly simple, amazingly good.

• Constraints: a really novel idea– Lets us do algebra on regions.

• Paper:There Goes the Neighborhood: Relational Algebra for Spatial Data Search

• Idea in backup slides.

Page 32: What to Do?  A Research Agenda

Equations Define Subspaces

• For (x,y) above the lineax+by > c

• Reverse the space by-ax + -by > -c

• Intersect a 3 volumes: a1x + b1y > c1

a2x + b2y > c2

a3x + b3y > c3

x

y

x=c/a

y=c/b

ax + by = c

x

y

Page 33: What to Do?  A Research Agenda

HTM Approach• Table-valued function

find points near a point– Select * from fGetNearbyEq(ra,dec,r)

• Use Hierarchical Triangular Mesh www.sdss.jhu.edu/htm/

– Space filling curve, bounding triangles…– Standard approach

• 13 ms/call… So 70 objects/second.• Too slow, so precompute neighbors:

Materialized view.• At 70 objects/sec

it takes 6 months to compute a billion objects.

Page 34: What to Do?  A Research Agenda

Areas defined by String(or struct in C# world)

• circleSpec := CIRCLE J2000 ra dec radArcMin • | CIRCLE CARTESIAN x y z radArcMin• rectSpec := RECT J2000 {ra dec}2• polySpec := POLY J2000 {ra dec}3+• | POLY CARTESIAN { x y z }3+• hullSpec := CHULL J2000 {ra dec}3+• | CHULL CARTESIAN { x y z }3+• convexSpec := CONVEX { x y z d}+• regionSpec := REGION { convexSpec }+ • areaSpec := circleSpec | rectSpec | polySpec • | hullSpec | regionSpec

Page 35: What to Do?  A Research Agenda

Working in 3D Avoids Spherical Geometry

• P = (px,py,pz) Inside circle C centered at x,y,z,With radius r radians ifP•C > cos(r)

• Arbitrary polygons are intersections of these regions.

Cos(r)

Page 36: What to Do?  A Research Agenda

Find Points Inside Area

• fGetNearbyObjEq(ra,dec,r)fGetNearbyObjXyz(x,y,z,r)fGetNearestObjEq(ra,dec,r)fGetNearestObjEq(x,y,z,r)

• fGetObjInside(region)

• Recently Alex added Healpix,Igloo is also a nice iso-area decomposition

Page 37: What to Do?  A Research Agenda

HTM reprise

• Good for point-in-area.

• Not good for area-overlaps-area(but can simplify areas and test for empty)

Page 38: What to Do?  A Research Agenda

To Repeat (for area algebra) Equations Define Subspaces

• For (x,y) above the lineax+by > c

• Reverse the space by-ax + -by > -c

• Intersect a 3 volumes: a1x + b1y > c1

a2x + b2y > c2

a3x + b3y > c3

x

y

x=c/a

y=c/b

ax + by = c

x

y

Page 39: What to Do?  A Research Agenda

Domain is Union of Convex Hulls

• Simple volumes are unions of convex hulls.

• Higher order curves also work

• Complex volumes have holes and their holes have holes. (that is harder).

Not a convex hull

+

Page 40: What to Do?  A Research Agenda

Now in Relational Termscreate table HalfSpace (

domainID int not null -- domain name foreign key references Domain(domainID), convexID int not null, -- grouping a set of ½ spaces halfSpaceID int identity(), -- a particular ½ space x float not null, -- the (a,b,..) parameters y float not null, -- defining the ½ space z float not null, c float not null, -- the constant (“c” above) primary key (domainID, convexID, halfSpaceID)

(x,y,z) inside a convex if it is inside all lines of the convex(x,y,z) inside a convex if it is NOT OUTSIDE ANY line of the convex

select convexID -- return the convex hullsfrom HalfSpace -- from the constraintswhere @x * x + @y * y + @x * z < l -- point outside the line?group by all convexID -- consider all the lines of a

convexIDhaving count(*) = 0 -- count outside == 0

Page 41: What to Do?  A Research Agenda

The Algebra is Simple (Boolean)@domainID = spDomainNew (@type varchar(16), @comment varchar(8000))@convexID = spDomainNewConvex (@domainID int)@halfSpaceID = spDomainNewConvexConstraint (@domainID int, @convexID int, @x float, @y float, @z float, @l float)@returnCode = spDomainDrop(@domainID)

select * from fDomainsContainPoint(@x float, @y float, @z float) Once constructed they can be manipulated with the Boolean operations.@domainID = spDomainOr (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000))@domainID = spDomainAnd (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000))@domainID = spDomainNot (@domainID1 int, @type varchar(16), @comment varchar(8000))

Page 42: What to Do?  A Research Agenda

What! No Bounding Box?

• Bounding box limits search.A subset of the convex hulls.

• If query runs at 3M halfspace/sec then no need for bounding box, unless you have more than 10,000 lines.

• But, if you have a lot of half-spaces then bounding box is good.

Page 43: What to Do?  A Research Agenda

Zone Approach• Divide space into zones• Key points by Zone, offset

(on the sphere this need wrap-around margin.)

• Point search look in a few zonesat a limited offset: ra ± ra bounding box that has

1-π/4 false positives• All inside the relational engine• Avoids “impedance mismatch” • Can “batch” all-all comparisons• 33x faster and parallel

6 days, not 6 months!

r ra-zoneMax

√(r2+(ra-zoneMax)2)cos(radians(zoneMax))

zoneMax

x

Ra ± x

Page 44: What to Do?  A Research Agenda

In SQL

select o1.objID -- find objectsfrom zone o1 -- in the zoned tablewhere o1.zoneID between -- where zone #

floor((@dec-@r)/@zoneHeight) and -- overlaps the circlefloor((@dec+@r)/@zoneHeight)

and o1.ra between @ra - @r and @ra + @r -- quick filter on ra and o1.dec between @dec-@r and @dec+@r -- quick filter on dec and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2))))

< @r -- careful filter on distance

Eliminates the ~ 21% = 1-π/4False positives

Bounding box

Page 45: What to Do?  A Research Agenda

Summary

• SQL is a set oriented language

• You can express constraints as rows

• Then You – Can evaluate LOTS of predicates per second– Can do set algebra on the predicates.

• Benefits from SQL parallelism

• SQL == Prolog?

Page 46: What to Do?  A Research Agenda

Talking the Talk and Walking the Walk

• We have 7 people – what’s our agenda?

• Observe it is all about:Data → Information → Knowledge → WisdomPeople == Communication is the “killer app”

• ½ Personal Information Management

• ½ Corporate Information Management