Querying Big Data Tractability revisited for querying big data BD-tractability
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
-
Upload
collin-nicholson -
Category
Documents
-
view
216 -
download
0
description
Transcript of Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
Querying the Internet with PIER
CS294-4Paul Burstein11/10/2003
Paul Burstein: PIER, 11/10/2003
2
Outline Motivation Architecture Join Algorithms Evaluation Discussion
Paul Burstein: PIER, 11/10/2003
3
Motivation Inject a degree of distribution into
databases Internet scale systems vs. hundred
node systems Large scale applications requiring
database functionaity
Paul Burstein: PIER, 11/10/2003
4
Applications P2P Databases
Highly distributed and available data
Network Monitoring Intrusion detection Fingerprint queries
Paul Burstein: PIER, 11/10/2003
5
Design Principles Relaxed Consistency
Sacrifice Consistency in face of Availability and Partition tolerance
Organic Scaling Growth with deployment
Natural Habitats for Data Data remains in original format with a DB
interface Standard Schemas
Achieved though common software
Paul Burstein: PIER, 11/10/2003
6
Outline Motivation Architecture Join Algorithms Evaluation Discussion
Paul Burstein: PIER, 11/10/2003
7
PIER Architecture
Paul Burstein: PIER, 11/10/2003
8
DHT Design Implemented with CAN
and Chord Routing Layer
Mapping for keys Storage Manager
Node data storage Provider
Storage access interface for higher levels
Paul Burstein: PIER, 11/10/2003
9
Routing & Storage Routing Layer
DHT-based API locationMapChange – local
key set change Storage Manager
Easy to realize API Efficient performance relative
to network Main-memory storage
manager used
Paul Burstein: PIER, 11/10/2003
10
Provider Couples the routing
and storage layers namespace – relation resourceId – primary key
namespace + resourceId key instanceId – distinguishes objects with same
namespace and resourceID lifetime – item storage duration
multicast – contacts namespace’s nodes lscan – iterates over a node’s local data newData – application callback on data arrival
Paul Burstein: PIER, 11/10/2003
11
PIER Query Processor Query dataflow engine Operators:
Selection, projection, joins, grouping, aggregation
Operators push and pull data Current data modification is
though the DHT interface Relaxed consistency and reachable snapshot
Working only with nodes reachable at the time a query is issued
Paul Burstein: PIER, 11/10/2003
12
Outline Motivation Architecture Join Algorithms Evaluation Discussion
Paul Burstein: PIER, 11/10/2003
13
Join Algorithms Symmetric Hash Join
Rehashes the relations Scan and copy
Fetch Matches One relation already hashed on join attribute
R, S – relations Nr, Ns – relation namespaces Nq - DHT-based temporary table
Paul Burstein: PIER, 11/10/2003
14
Join Rewriting Aimed at lowering the bandwidth
utilization
Symmetric semi-join Local projections to join keys Global fetch matches join
Bloom joins Local bloom filters are published into
temporary namespaces Filters multicast to opposite relation’s nodes
Paul Burstein: PIER, 11/10/2003
15
How does this scale?
Paul Burstein: PIER, 11/10/2003
16
Outline Motivation Architecture Join Algorithms Evaluation Discussion
Paul Burstein: PIER, 11/10/2003
17
Workload Parameters CAN configuration: d = 4 R 10 times larger than S Constants provide 50% selectivity f(x,y) evaluated after the join 90% of R tuples match a tuple in S Result tuples are 1KB each Symmetric hash join used
Paul Burstein: PIER, 11/10/2003
18
Simulation Setup Up to 10,000 nodes Network cross-traffic, CPU and
memory utilizations ignored1. 100ms and 10Mbps fully connected links2. GT-ITM transit-stub topology
Paul Burstein: PIER, 11/10/2003
19
Scalability 1MB data per node Fully-connected topology Variable number of
computation nodes
Network congestion is an issue with few computation nodes
How is the computation workload distributed?
Paul Burstein: PIER, 11/10/2003
20
Join Algorithms (1/2) Infinite Bandwidth 1024 data and computation nodes Core join Algorithms
Perform faster Rewrites
Bloom Filter: two multicasts Semi-join: two CAN lookups
Paul Burstein: PIER, 11/10/2003
21
Join Algorithms (2/2) Limited Bandwidth
10Mbps inbound capacity 25GB relations, 1024 nodes
Symmetric Hash Join Rehashes both tables
Semi-join Transfers only matching tuples
At 40% selectivity, bottleneck switches from computation nodes to query sites
Paul Burstein: PIER, 11/10/2003
22
Soft State Failure detection and
recovery 15 second failure
detection 4096 nodes
Refresh period Time to reinsert lost
tuples
Paul Burstein: PIER, 11/10/2003
23
Transit Stub Topology GT-ITM
4 Domains, 10 nodes per domain, 3 stubs per node
50ms, 10ms, 2ms latency 10Mbps inbound links
Similar trends as fully connected topology A bit longer end-to-end
delays
Paul Burstein: PIER, 11/10/2003
24
Experimental Results 64 PCs on 1Gbps
network All nodes are
computation nodes
Paul Burstein: PIER, 11/10/2003
25
Outline Motivation Architecture Join Algorithms Evaluation Discussion
Paul Burstein: PIER, 11/10/2003
26
Discussion PIER presents a distributed query
engine
What remains to be done? DB issues Networking issues