Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis...

64
Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science http://www.cs.wustl.edu/~cytron/ Century Club May 2002 Roger Chamberlain, Mark Franklin, Ron Indeck, John Lockwood, George Varghese (UCSD) Mahesh Jayaram Thanks: Ben Brodie Center for Distributed Object Computing Department of Computer Science Washington University

Transcript of Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis...

Page 1: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Finding Needles in the Internet Haystack

Ron K. CytronWashington University in Saint Louis

Department of Computer Sciencehttp://www.cs.wustl.edu/~cytron/

Century Club May 2002

Roger Chamberlain, Mark Franklin, Ron Indeck, John Lockwood, George Varghese (UCSD)

Mahesh JayaramThanks: Ben Brodie

Center for Distributed Object ComputingDepartment of Computer Science

Washington University

Page 2: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Outline• Computers have come a long way

                                

Page 3: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Outline• Computers have come a long way

• Today’s computers are never lonely

Page 4: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Outline• Computers have come a long way

• Today’s computers are never lonely

• Volumes and volumes of data

Page 5: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Outline• Computers have come a long way

• Today’s computers are never lonely

• Volumes and volumes of data

• Fast searching of magnetic media

needle

needel needle

Page 6: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Outline• Computers have come a long way

• Today’s computers are never lonely

• Volumes and volumes of data

• Fast searching of magnetic media

• Internet packet filtering

Page 7: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Outline• Computers have come a long way

• Today’s computers are never lonely

• Volumes and volumes of data

• Fast searching of magnetic media

• Internet packet filtering

• Conclusion

Page 8: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

A Grandchild’s Gift1966

1999

Cost: $60 Cost: $35

Memory ½ char Memory 16 M chars

Speed: 1 cycle/s Speed: 16 M cycles/s

Fails: 10 seconds Fails: 5 years

Page 9: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

If cars improved that much in 30 years …

• $4000

• 60,000 miles per hour

• Seats 10,000 people

• Gets 20,000 miles per gallon

• Breaks every 70 years

Page 10: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

The Haystack

• The Internet is large and growing

• Content on the Internet is growing even faster

• A haystack sits still, but the Internet….

Page 11: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

1969 1971 1973 1977 1983 1991 1993 1994

Year

Inte

rco

nn

ect

ed

Co

mp

ute

rsGrowth of the Internet

(why computers aren’t lonely anymore)

Y2K Problem (?):

More computers sold than TVs

Page 12: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

0

5,000

10,000

15,000

20,000

25,000

30,000

1979 1980 1988 1993

Year

Art

icle

s p

er

Da

y

Growth of Internet Content(volumes and volumes of data)

Anybody can publish

Problem is how to find what you want

Page 13: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Page 6B

What can tech companies do? Some say they're at a loss, but others offer budding solutions

By Kevin Maney

On July 7, 1940, as the nation edged toward World War II, IBM put out a statement that made headlines. The company offered all its facilities for national defense, ready to convert to making anything the government needed.

Other leaders in the electro-mechanical technology of the day -- Ford Motor, General Motors, General Electric -- also threw their weight into defense efforts. They switched from making cars and washing machines to building tanks, aircraft engines and machine guns.

So here we are in 2001, readying for another war. The U.S. technology industry is the best and most innovative in the world. It is the nation's pride and joy.

Shouldn't it do something?

9/17/2001

Page 14: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

. . .

One possibility is in data-mining technology. Data mining is a way to collect millions of pieces of information in a computer system, sift through that data, make sense of them and come up with something useful. ''We (the U.S. tech industry) are experts at data mining and have vast resources of data to mine,'' says Tom Evslin, CEO of Internet communications company ITXC. ''We have used it to target advertising. We can probably use it to identify suspicious activity or potential terrorists.''

. . .

Page 15: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Fast searching of magnetic mediawith

Roger Chamberlain, Mark Franklin, Ron Indeck,

John Lockwood

Page 16: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Enabling Technology: Disk Drives

Magnetic disk storage areal density vs. year of IBM product introduction

(From D. A. Thompson)

Almost 10,000,000x increase in 45 years!

Page 17: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Cost per Megabyte

Price history of hard disk product vs. year of product introduction

(From D. A. Thompson)

Cost decreasing 3% per week!

Page 18: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

• Storage industry will ship 4,000,000,000,000,000,000 Bytes this year

• FedEx generated 14 Terabytes of data last year

• US intelligence collects data equaling the printed collection of the US library every day!

Massive Storage & Data

Page 19: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Massive Data Sets

• Employee records• Consumer information• Maps/mission/intelligence data• Genome maps Data sets now measured in Terabytes, and

are dynamic!

Page 20: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Genome Application

• Genome maps growing expanded daily– Wash U sequencing center– Each of us has 80,000 genes found among 3 billion

characters of DNA (A,C,G,T)

• Look for matches– Identify function– Disease: understand, diagnose, detect, medicine, therapy– Biofuels, warfare, toxic waste– Understand evolution– Forensics, organ donors, authentication– More effective crops, disease resistance

Page 21: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

DNA String Matching

• Looking for CACGTTAGT…TAGC

• Interested in matches and near matches

• Search human genome and other gene oceans– Need to search entire data sets

Page 22: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Bio Computation Problem

*BIG* Genome

DatabasesA C G T G

T A C A G

DNA pattern

DNA sequence

Match?Approximate matches are just as useful

Page 23: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Finding a needel in a heystuck• DNA and live text can contain errors• We often seek an approximate match, for

exampleneedle

• No match? Try 2-transpositionsenedle, needle, nedele, neelde, needel

• No match? Try 1-deletionseedle, nedle, nedle, neele, neede, needl

• No match? Try insertions, larger edits, …• An exponential number of possibilities

Page 24: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

No

How is this done today?

• Think of every way a word can be misspelled• Present each misspelling to the computer for an

exact match

enedle needle nedele neelde needel

Yes

Page 25: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

How can we do better?

• Data is present on magnetic media

• Hardware at the disk is– Already fault tolerant (more on this later)

needel needle

– Distributed across all surfaces

needle

needel

We win if number of misspellings is large, and the number of false hits is small

Page 26: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Another Application:Intelligence Data

• Lots of data

• Changing constantly

• Many perturbations– Tzar, tsar, czar, . . .

• Don’t know what we want to look for beforehand

Page 27: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Google Search Engine

• Crawls the web once per month

• Caches web pages

• Fast, exact text-based search (see how soon)

needle

needleneedel

Page 28: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Image Database Applications

• Challenging database

• Unstructured

• Massive data sets

• Don’t know what we need to look for in each picture

Page 29: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Satellite Data

• Low-orbit fly-over every 90 minutes• Look for differences in images

– Large objects– Troops– Changes to landscape

• Flag, transmit these differences immediately• National Reconnaissance Office• City assessors . . .

Page 30: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Washington University

Hilltop Campus

Page 31: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

How do we find what we’re looking for?!

Page 32: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Conventional Structured DatabaseDid

43

12

DocumentAgent James Bond

Agent mobile computerJames Madison movie

James Bond movie

Word

Jamescomputer

agentBond

Inverted list - pointers<1,2><1,4><2>

<1,3,4>Madison <3>mobile <2>movie <3,4>

Page 33: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Challenges in SearchingMassive Databases

Know what to search for– need to build index beforehand– maintain index as it changes

Do not know what to search for– need to search the whole database!

Page 34: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Conventional Search

Hard drive

Processor

MemoryI/O bus

Memory bus

Page 35: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Conventional Search

Hard drive

Processor

MemoryI/O bus

Memory bus

find ….

Conventional Search

Page 36: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Conventional Search

Hard drive

Processor

MemoryI/O bus

Memory buscontents

yes, no, no, yes, yes ….

Conventional Search

Page 37: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Conventional Approach

Page 38: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

WUSTL’s Approach

Page 39: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Hard drive

Processor

Memory

I/O bus

Memory Bus

Reconfigurable hardware

Memory/processing

Streaming Approach

Page 40: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Hard drive

Processor

Memory

I/O bus

Memory Bus

Reconfigurable hardware

Memory/processing

find

Streaming Approach

Page 41: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Hard drive

Processor

Memory

I/O bus

Memory Bus

Reconfigurable hardware

Memory/processing

find

Streaming Approach

Page 42: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Hard drive

Processor

Memory

I/O bus

Memory Bus

Reconfigurable hardware

Memory/processing

Parallelism through each transducer and drive

find

yes, no, no, yes, yes

Streaming Approach

Page 43: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Magnetic Recording Channel Schematic

Encoder

Decoder

Detector

Input UserData

Decoded UserData

Channel Bits

Head Disk

Analog Readback

A

BC

To Bus or Cache

Page 44: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Key streaming over Data

Page 45: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Disk Level Implementation

100-bit-key matching through a pseudo-random binary series

score

matches

Page 46: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Status: Prototype in progress

FPX

NID

RAD

Hard drive

HostATAPI

Controller

IDE busIDE bus

Tap16bit Data

15bit CTRL

Custom PCB forElectrical Termination &5V to 3.3V Conversion

32 RADtest pins

Loopback module

module

Setup reused from FPX

IDE_to_ATM module

Page 47: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Internet Packet Filteringwith

Mahesh Jayaram and

George Varghese

Page 48: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Finding Needles in a Moving Haystack

Page 49: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

As technology improves, transmission time decreases but latency stays the same

Year

Cost of Internet Request

Latency

Transmission

Time

Page 50: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Example: Garden Hose

Water Supply

Latency (first drop) ~ distance

Bandwidth ~ hose diameter

Fire department and gardener suffer the same wait

Page 51: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Example: Hot Shower

You want this water

Latency (time to get hot water) ~ distance

Page 52: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Convection circuit continuously circulates hot water

Latency ~ 0

Latency-Free Hot Shower

Page 53: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Better to receive than to give

• Cable broadcast

• Radio broadcast

• TV guide channel

• Gate connection announcements in flight

• Winning lottery number

Modern name: push technology

Page 54: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Better to receive than to give

Page 55: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

How do you get what you want?

Page 56: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Packet Filters

Filter F(Weather)

Page 57: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Packet Filters

Filter F(Weather)

Page 58: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Existing Approach

IBM Quote

Weather

Flight Schedule

Page 59: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Our approach

IBM QuoteWeatherFlight Schedule

Composite filter makes just one pass

Page 60: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

How we do it

IBM Quote

Weather

Flight Schedule

Grammar 1

Grammar 2

Grammar 3

Parsing Engine

Page 61: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

TCPConnHeader : EtherType IPHeader TCPPortPair

EtherType : #IP_TYPE

IPHeader : Vers HlenPlusRest

Vers : HalfByte

HlenPlusRest : 0 1 0 1 FixedRest | 0 1 1 0 FixedRest OneIPOption

| 0 1 1 1 FixedRest TwoIPOption

| 1 0 0 0 FixedRest ThreeIPOption

| 1 0 0 1 FixedRest FourIPOption

| 1 0 1 0 FixedRest FiveIPOption

| 1 0 1 1 FixedRest FiveIPOption OneIPOption

| 1 1 0 0 FixedRest FiveIPOption TwoIPOption

| 1 1 0 1 FixedRest FiveIPOption ThreeIPOption

| 1 1 1 0 FixedRest FiveIPOption FourIPOption

| 1 1 1 1 FixedRest FiveIPOption FiveIPOption

FixedRest : ServiceType TotalLength Identification Flags

FragmentOffset TimeToLive Protocol HeaderChecksum IPAddrPair

ServiceType : Byte

TotalLength : TwoByte

Identification : TwoByte

Flags : bit bit bit

FragmentOffset : bit Byte HalfByte

TimeToLive : Byte

Protocol : #TCP_PROTOCOL

HeaderChecksum : TwoByte

IPAddrPair : #IP_SRC_DST_PAIR

FiveIPOption : ThreeIPOption TwoIPOption

FourIPOption : TwoIPOption TwoIPOption

ThreeIPOption : TwoIPOption OneIPOption

TwoIPOption : OneIPOption OneIPOption

OneIPOption : Option Padding

Option : ThreeByte

Padding : Byte

TCPPortPair : #TCP_PORT_PAIR

FourByte : TwoByte TwoByte

ThreeByte : TwoByte Byte

TwoByte : Byte Byte

Byte : HalfByte HalfByte

HalfByte : bit bit bit bit

bit : 0

| 1

Sample grammar for TCP packet

Page 62: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Results

The more things you want, the slower existing approaches get

Our performance doesn’t degrade

Page 63: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Conclusions

• The Internet and its content are growing explosively

• Disk storage is abundant, cheap, reliable

• Technology must provide fast, inexact searching of text and images

• As more data is hurled at and past us, fast filtering of Internet traffic is a must

Page 64: Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science cytron

Questions?