TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources
-
Upload
keely-graham -
Category
Documents
-
view
31 -
download
0
description
Transcript of TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources
TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources
Jiansheng Huang
Jeffrey F. Naughton
Miron Livny
2
Motivating Scenario
In a distributed monitoring system, autonomous nodes report in at
unpredictable intervals, state captured at central site is always out of date and
inconsistent
3
Specific Scenario A cluster of machines A job can be submitted to any node. Submit node may schedule a job to be
run on another machine. The state is captured in a centralized
database.
4
An Example
m2m1
State of m2State of m1
Central site
Submit job j
Job j received here
Schedule job j
Job j running here
No info about job j No info about job j Job j received here & scheduled to run on m2
Job j is running here Job j received here & scheduled to run on m2
No info about job jJob j is running hereNo info about job j
5
Enforcing Consistency Option 1: Do everything in distributed
transactions Won’t scale to large systems. At odds with the autonomous nature of
nodes.
Option 2: Only present latest consistent snapshot Can give rise to very out of date information
to user.
6
Problem Addressed Question: how can we help users cope
with inconsistencies in collected data while retaining the scalability and autonomy of the system?
Our answer: instead of enforcing consistency, allow inconsistency and help user interpret what they see.
Issue: How to do so efficiently and without swamping user with too much irrelevant information?
7
Reporting RecencyA user asks: “Has the machine that m1 scheduled the job j to run started running it?”
Our idea is to only report: m1 last reported in at 09/12/2006 15:20, m2 last reported in at 09/11/2006 09:30 and nothing else.
Answer without recency reporting: NOT YET
State of m2State of m1
Information system at central site
Job j received here & scheduled to run on m2
No info about job j
… 9998 more
……
Naïve way to report recency: m1 last reported in at…, m2 last reported in at…, m3 last reported in at …… m10000 last reported in at …
9
Terminology
S
RDBMS
Data source: an abstraction for a node being monitored.
Recency timestamp: the most recent time a data source reported in.
streaming of facts
10
Goals
Completeness: all “relevant” data sources are in report.
Precision: reduce the number of “irrelevant” data sources included in report using efficient techniques.
11
Schema Model
Assumption: updates from a data source can only make changes to tuples with its own source id in the data source field.
Other relevant relations (c1, c2,…, source id)
lastReport table (source id, recency ts)
12
Roadmap Background Definitions and Techniques Prototype and Evaluation Conclusion and Future Work
13
Definitions
Definition 1. If Q references a single relation R, we say a data source s is relevant if exists a potential tuple t for R from s, s.t. t satisfies Q’s predicates.
Definition 2. For a query Q referencing relations R1, R2, …, Rn, we say that a data source s is relevant for Q if exists j and a potential tuple tj for Rj, and for any k ≠j, exists a tuple tk for Rk, such that these tuples together satisfies Q’s predicates. In this case we say that s is relevant for Q via Rj.
Theorem 1. No single update from an irrelevant data source can change the result of a query
14
Example 1
03/12/2006 10:23:05IdleM3
02/10/2006 18:22:01BusyM2
03/11/2006 20:37:46IdleM1
Event_timeValueMach_idTable 1: An example instance for Activity
SELECT mach_id FROM ActivityWHERE mach_id IN (‘m1’, m2’) AND value = ‘idle’;
The query result is {‘m1’}. The set of relevant data sources for the query is {‘m1’, ‘m2’}.
Suppose we keep track of machine activities in a table called Activity. The attributes of Activity are mach_id, activity value and the time when an activity value becomes valid. We treat the machine ID as the data source column.
Activity(mach_id, value, event_time)
15
Example 2
Consider a P2P system where we use Routing to capture neighboring relationships. Mach_id is treated as the data source column. Activity is same as in example 1.
Routing(mach_id, neighbor, event_time)
02/10/2006 03:34M3M2
03/12/2006 23:20M3M1
Event_timeNeighborMach_id
Table 2: A sample instance for Routing
SELECT A.mach_id FROM Routing R, Activity AWHERE R.mach_id = ‘m1’ AND R.neighbor = A.mach_id AND
A.value = ‘idle’;
The query result is {‘m3’} . The set of data sources relevant via R is {‘m1’}, the set of data sources relevant via A is {‘m3’}
03/12/2006 10:23IdleM3
02/10/2006 18:22BusyM2
03/11/2006 20:37IdleM1
Event_timeValueMach_id
Table 3: An example instance for Activity
16
The Focused Method
User QueryAnalyze Generate
Query parts
lastReport
Evaluate
Recency query
Evaluate
Recency Report
Query Result
System
17
Roadmap Background Definitions and Techniques Prototype and Evaluation Conclusion and Future Work
18
Recency Reporting Prototype
PL/pgSQL table function recencyReport: accepts a user query, evaluates it and reports recency information
Usage spec: SELECT * FROM recencyReport($$SQL TEXT$$);
Recency information includes: A temporary table name for exceptional relevant
data sources and their recency timestamps Another temporary table name for the other relevant
data sources and their recency timestamps The least recent data source and its recency
timestamp The most recent data source and its recency
timestamp Bound of inconsistency
19
Goals of Experiments Our approach raises many questions:
Is it expensive to analyze user queries and generate focused recency queries?
Is it inefficient to evaluate the recency query in addition to each user query?
Does the focused recency query really succeed in reducing the number of irrelevant data sources in recency report?
Our experiments are an attempt to begin to answer these questions empirically.
20
Methods Evaluated Naïve method: the recency query
simply returns all data sources as being relevant
Focused method: with automatic generation of recency query
A variant of Focused method: with hardcoding of recency query
21
Evaluation Metrics False positive rate: the percentage of
the number of irrelevant data sources reported vs. the number of relevant data sources.
Response time overhead: time overhead for the additional recency reporting
22
Schema and Data lastReport, Activity, Routing as in earlier
examples Synthetic data: total row count of Activity
fixed at 10,000,000. Vary the number of data sources and number of rows per data source (data ratio) inversely.
23
Test Queries Selection query and two-way join
queries are measured Q3: joins Routing and Activity with a very
selective predicate on Routing Q4: similar to Q3, but with a non
selective predicate on Routing
24
Overhead Comparisons: Q3
Figure1: Q3’s performance overhead for recency and consistency reporting w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources) =10,000,000).
25
High Overhead Region, Q3
Figure2: Response times for Q3 with and without recency report w.r.t data ratio and # of data sources ((data ratio)×(#of data sources)= 10,000,000). The Focused method with auto generation of recency query is used here.
26
Overhead Comparisons, Q4
Figure3: Q4’s performance overhead for recency and consistency reporting w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources) =10,000,000).
27
Performance Evaluation Summary The overhead for analyzing a user query
and generating a recency query is insignificant.
The overhead for evaluating a recency query is insignificant and the focused method has less or equal cost than the naïve method unless Data ratio is very low and A query has a join and The query is not selective on data sources.
28
False Positive Rates Naïve method: depending on exact
number of data sources, assuming 100,000 for illustration Q3: fpr = (100000-6)/6 = 16665
Q4: fpr = 6/(100000-6) = 0.00006
Focused methods: all 0 because the precise sets of relevant data sources are found
29
Roadmap Background Definitions and Techniques Prototype and Evaluation Conclusion and Future Work
30
Conclusion Large scale asynchronous system: reporting
recency, rather than enforcing it, is a viable solution
Defining “relevance” is non-trivial. Our solution: a data source is relevant if a single update from it will change the result of a query
Evaluation on our prototype showed that our methods incur insignificant overhead in most cases and are more precise than the naïve method
31
Future Work Key constraints Maintenance cost Other definitions of relevance
A data source is “relevant” if N updates from it may change query result
A data source is “relevant” if a sequence of updates from it may change query result
34
Example 3
Mach_id Value Event_time
M1 Busy 03/11/2006 20:37:46
M2 Busy 02/10/2006 18:22:01
M3 Busy 03/12/2006 10:23:05
Table 1: Another sample instance for Activity, same Routing table instance as in Example 2
SELECT A.mach_id FROM Routing R, Activity AWHERE R.mach_id = ‘m1’ AND A.value = ‘idle’ AND R.neighbor = A.mach_id;
The query result is empty. The set of data sources relevant via R is Ø, the set of data sources relevant via A is {‘m3’}. Therefore m1 is not relevant, but two updates from m1 will change the query result: 1) m1 is updated to ‘idle’ in Activity, and 2) m1 is added as a neighbor of m1 itself in Routing.
35
Reporting recency Recency information of relevant data sources
are stored in a session duration temporary table
Possible that even the number of relevant sources will be large, so we provide additional summary information: the minimum recency timestamp, the maximum recency timestamp, and the range of recency
Also since machine failures are common in Condor, report exceptional data sources using z-score outlier detection
36
Experiment Setup System: Tao Linux 1.0, 2.4 GHz Intel,
512MB memory Database: PostgreSQL 8.0.0 Shared buffer pool: 8MB Working memory size: 1MB Each query run 11 times and avg time of
the last 10 runs is used to minimize fluctuation.
37
Test Queries Q1: SELECT COUNT(*) FROM Activity A WHERE A.mach_id IN
(‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND A.value = ‘idle’;
Q2: SELECT COUNT(*) FROM Activity A WHERE A.mach_id NOT IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND A.value = ‘idle’;
Q3: SELECT COUNT(*) FROM Routing R, Activity A WHERE R.mach_id IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id AND A.value = ‘idle’;
Q4: SELECT COUNT(*) FROM Routing R, Activity A WHERE R.mach_id NOT IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id AND A.value = ‘idle’;
38
Roadmap Background Definitions and Techniques Prototype and Evaluation Conclusion and Future Work Related Work
39
Enforcing currency and consistency R. Alonso et al., Quasi-copies: Efficient data sharing for information
retrieval systems. In EDBT, pages 443-468, 1988. H. Garcia-Molina et al., Read-only transactions in a distributed database.
ACM Trans. Database Syst., 7(2):209-234, 1982. R. Lenz. Adaptive distributed data management with weak consistent
replicated data. In SAC, pages 178-185, 1996. A. Segev and W. Fang. Currency-based updates to distributed
materialized views. In ICDE, pages 512-520, 1990. A. Labrinidis et al., Balancing performance and data freshness in web
database servers. In VLDB, pages 393-404, 2003. L. Bright et al., Using latency-recency profiles for data delivery on the web.
In VLDB, pages 550-561, 2002. H. Guo et al., Relaxed currency and consistency: How to say “good
enough” in SQL. In SIGMOD Conference, pages 815-826, 2004.
A common theme here is to enforce recency constraints through a combination of choosing the correct version of an object to query (I.e., the cached or the primary copy) or refreshing “stale” objects by synchronously “pulling” new data in response to a query.
40
Data Lineage Y. Cui et al., Lineage tracing for general data
warehouse transformations. In VLDB, pages 471-480, 2001.
Identify the set of source data items that produced a view item, our work is different in that even if a data source doesn’t contribute any lineage data items (possibly due to latency in reporting in or some error), it may still be “relevant”.
41
Distributed Query Processing R. Munz et al., Application of sub-predicate
tests in database systems. In A. L. Furtado and H. L. Morgan, editors, VLDB, pages 426-435, 1979.
The problem statement relies on data items being placed in a distributed environment in such a way that they satisfy various predicates, and then they use the interaction of the data placement predicates and query predicates to identify where data satisfying a simple query might be located
42
Partitioning Pruning D. J. DeWitt et al., Gamma - a high
performance dataflow database machine. In VLDB, pages 228-237, 1986.
IBM. DB2 UDB for z/OS Version 8 Performance Topics, 2005.
Oracle Corporation. Oracle Database Concepts, 10g Release 1, 2003.
Choose partitions by matching certain types of selection predicates with the “partitioning predicates”