Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.

Querying the Internet with PIER

CS294-4Paul Burstein11/10/2003

Paul Burstein: PIER, 11/10/2003

2

Outline Motivation Architecture Join Algorithms Evaluation Discussion


3

Motivation Inject a degree of distribution into

databases Internet scale systems vs. hundred

node systems Large scale applications requiring

database functionaity


4

Applications P2P Databases

Highly distributed and available data

Network Monitoring Intrusion detection Fingerprint queries


5

Design Principles Relaxed Consistency

Sacrifice Consistency in face of Availability and Partition tolerance

Organic Scaling Growth with deployment

Natural Habitats for Data Data remains in original format with a DB

interface Standard Schemas

Achieved though common software


6



7

PIER Architecture


8

DHT Design Implemented with CAN

and Chord Routing Layer

Mapping for keys Storage Manager

Node data storage Provider

Storage access interface for higher levels


9

Routing & Storage Routing Layer

DHT-based API locationMapChange – local

key set change Storage Manager

Easy to realize API Efficient performance relative

to network Main-memory storage

manager used


10

Provider Couples the routing

and storage layers namespace – relation resourceId – primary key

namespace + resourceId key instanceId – distinguishes objects with same

namespace and resourceID lifetime – item storage duration

multicast – contacts namespace’s nodes lscan – iterates over a node’s local data newData – application callback on data arrival


11

PIER Query Processor Query dataflow engine Operators:

Selection, projection, joins, grouping, aggregation

Operators push and pull data Current data modification is

though the DHT interface Relaxed consistency and reachable snapshot

Working only with nodes reachable at the time a query is issued


12



13

Join Algorithms Symmetric Hash Join

Rehashes the relations Scan and copy

Fetch Matches One relation already hashed on join attribute

R, S – relations Nr, Ns – relation namespaces Nq - DHT-based temporary table


14

Join Rewriting Aimed at lowering the bandwidth

utilization

Symmetric semi-join Local projections to join keys Global fetch matches join

Bloom joins Local bloom filters are published into

temporary namespaces Filters multicast to opposite relation’s nodes


15

How does this scale?


16



17

Workload Parameters CAN configuration: d = 4 R 10 times larger than S Constants provide 50% selectivity f(x,y) evaluated after the join 90% of R tuples match a tuple in S Result tuples are 1KB each Symmetric hash join used


18

Simulation Setup Up to 10,000 nodes Network cross-traffic, CPU and

memory utilizations ignored1. 100ms and 10Mbps fully connected links2. GT-ITM transit-stub topology


19

Scalability 1MB data per node Fully-connected topology Variable number of

computation nodes

Network congestion is an issue with few computation nodes

How is the computation workload distributed?


20

Join Algorithms (1/2) Infinite Bandwidth 1024 data and computation nodes Core join Algorithms

Perform faster Rewrites

Bloom Filter: two multicasts Semi-join: two CAN lookups


21

Join Algorithms (2/2) Limited Bandwidth

10Mbps inbound capacity 25GB relations, 1024 nodes

Symmetric Hash Join Rehashes both tables

Semi-join Transfers only matching tuples

At 40% selectivity, bottleneck switches from computation nodes to query sites


22

Soft State Failure detection and

recovery 15 second failure

detection 4096 nodes

Refresh period Time to reinsert lost

tuples


23

Transit Stub Topology GT-ITM

4 Domains, 10 nodes per domain, 3 stubs per node

50ms, 10ms, 2ms latency 10Mbps inbound links

Similar trends as fully connected topology A bit longer end-to-end

delays


24

Experimental Results 64 PCs on 1Gbps

network All nodes are

computation nodes


25



26

Discussion PIER presents a distributed query

engine

What remains to be done? DB issues Networking issues

Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.

Documents

Transcript of Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.