TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources

TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources

Jiansheng Huang

Jeffrey F. Naughton

Miron Livny

2

Motivating Scenario

In a distributed monitoring system, autonomous nodes report in at

unpredictable intervals, state captured at central site is always out of date and

inconsistent

3

Specific Scenario A cluster of machines A job can be submitted to any node. Submit node may schedule a job to be

run on another machine. The state is captured in a centralized

database.

4

An Example

m2m1

State of m2State of m1

Central site

Submit job j

Job j received here

Schedule job j

Job j running here

No info about job j No info about job j Job j received here & scheduled to run on m2

Job j is running here Job j received here & scheduled to run on m2

No info about job jJob j is running hereNo info about job j

5

Enforcing Consistency Option 1: Do everything in distributed

transactions Won’t scale to large systems. At odds with the autonomous nature of

nodes.

Option 2: Only present latest consistent snapshot Can give rise to very out of date information

to user.

6

Problem Addressed Question: how can we help users cope

with inconsistencies in collected data while retaining the scalability and autonomy of the system?

Our answer: instead of enforcing consistency, allow inconsistency and help user interpret what they see.

Issue: How to do so efficiently and without swamping user with too much irrelevant information?

7

Reporting RecencyA user asks: “Has the machine that m1 scheduled the job j to run started running it?”

Our idea is to only report: m1 last reported in at 09/12/2006 15:20, m2 last reported in at 09/11/2006 09:30 and nothing else.

Answer without recency reporting: NOT YET

State of m2State of m1

Information system at central site

Job j received here & scheduled to run on m2

No info about job j

… 9998 more

……

Naïve way to report recency: m1 last reported in at…, m2 last reported in at…, m3 last reported in at …… m10000 last reported in at …

8

Roadmap Background Definitions and Techniques Prototype and Evaluation Conclusion and Future Work

9

Terminology

S

RDBMS

Data source: an abstraction for a node being monitored.

Recency timestamp: the most recent time a data source reported in.

streaming of facts

10

Goals

Completeness: all “relevant” data sources are in report.

Precision: reduce the number of “irrelevant” data sources included in report using efficient techniques.

11

Schema Model

Assumption: updates from a data source can only make changes to tuples with its own source id in the data source field.

Other relevant relations (c1, c2,…, source id)

lastReport table (source id, recency ts)

12


13

Definitions

Definition 1. If Q references a single relation R, we say a data source s is relevant if exists a potential tuple t for R from s, s.t. t satisfies Q’s predicates.

Definition 2. For a query Q referencing relations R1, R2, …, Rn, we say that a data source s is relevant for Q if exists j and a potential tuple tj for Rj, and for any k ≠j, exists a tuple tk for Rk, such that these tuples together satisfies Q’s predicates. In this case we say that s is relevant for Q via Rj.

Theorem 1. No single update from an irrelevant data source can change the result of a query

14

Example 1

03/12/2006 10:23:05IdleM3

02/10/2006 18:22:01BusyM2

03/11/2006 20:37:46IdleM1

Event_timeValueMach_idTable 1: An example instance for Activity

SELECT mach_id FROM ActivityWHERE mach_id IN (‘m1’, m2’) AND value = ‘idle’;

The query result is {‘m1’}. The set of relevant data sources for the query is {‘m1’, ‘m2’}.

Suppose we keep track of machine activities in a table called Activity. The attributes of Activity are mach_id, activity value and the time when an activity value becomes valid. We treat the machine ID as the data source column.

Activity(mach_id, value, event_time)

15

Example 2

Consider a P2P system where we use Routing to capture neighboring relationships. Mach_id is treated as the data source column. Activity is same as in example 1.

Routing(mach_id, neighbor, event_time)

02/10/2006 03:34M3M2

03/12/2006 23:20M3M1

Event_timeNeighborMach_id

Table 2: A sample instance for Routing

SELECT A.mach_id FROM Routing R, Activity AWHERE R.mach_id = ‘m1’ AND R.neighbor = A.mach_id AND

A.value = ‘idle’;

The query result is {‘m3’} . The set of data sources relevant via R is {‘m1’}, the set of data sources relevant via A is {‘m3’}

03/12/2006 10:23IdleM3

02/10/2006 18:22BusyM2

03/11/2006 20:37IdleM1

Event_timeValueMach_id

Table 3: An example instance for Activity

16

The Focused Method

User QueryAnalyze Generate

Query parts

lastReport

Evaluate

Recency query

Evaluate

Recency Report

Query Result

System

17


18

Recency Reporting Prototype

PL/pgSQL table function recencyReport: accepts a user query, evaluates it and reports recency information

Usage spec: SELECT * FROM recencyReport($$SQL TEXT$$);

Recency information includes: A temporary table name for exceptional relevant

data sources and their recency timestamps Another temporary table name for the other relevant

data sources and their recency timestamps The least recent data source and its recency

timestamp The most recent data source and its recency

timestamp Bound of inconsistency

19

Goals of Experiments Our approach raises many questions:

Is it expensive to analyze user queries and generate focused recency queries?

Is it inefficient to evaluate the recency query in addition to each user query?

Does the focused recency query really succeed in reducing the number of irrelevant data sources in recency report?

Our experiments are an attempt to begin to answer these questions empirically.

20

Methods Evaluated Naïve method: the recency query

simply returns all data sources as being relevant

Focused method: with automatic generation of recency query

A variant of Focused method: with hardcoding of recency query

21

Evaluation Metrics False positive rate: the percentage of

the number of irrelevant data sources reported vs. the number of relevant data sources.

Response time overhead: time overhead for the additional recency reporting

22

Schema and Data lastReport, Activity, Routing as in earlier

examples Synthetic data: total row count of Activity

fixed at 10,000,000. Vary the number of data sources and number of rows per data source (data ratio) inversely.

23

Test Queries Selection query and two-way join

queries are measured Q3: joins Routing and Activity with a very

selective predicate on Routing Q4: similar to Q3, but with a non

selective predicate on Routing

24

Overhead Comparisons: Q3

Figure1: Q3’s performance overhead for recency and consistency reporting w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources) =10,000,000).

25

High Overhead Region, Q3

Figure2: Response times for Q3 with and without recency report w.r.t data ratio and # of data sources ((data ratio)×(#of data sources)= 10,000,000). The Focused method with auto generation of recency query is used here.

26

Overhead Comparisons, Q4

Figure3: Q4’s performance overhead for recency and consistency reporting w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources) =10,000,000).

27

Performance Evaluation Summary The overhead for analyzing a user query

and generating a recency query is insignificant.

The overhead for evaluating a recency query is insignificant and the focused method has less or equal cost than the naïve method unless Data ratio is very low and A query has a join and The query is not selective on data sources.

28

False Positive Rates Naïve method: depending on exact

number of data sources, assuming 100,000 for illustration Q3: fpr = (100000-6)/6 = 16665

Q4: fpr = 6/(100000-6) = 0.00006

Focused methods: all 0 because the precise sets of relevant data sources are found

29


30

Conclusion Large scale asynchronous system: reporting

recency, rather than enforcing it, is a viable solution

Defining “relevance” is non-trivial. Our solution: a data source is relevant if a single update from it will change the result of a query

Evaluation on our prototype showed that our methods incur insignificant overhead in most cases and are more precise than the naïve method

31

Future Work Key constraints Maintenance cost Other definitions of relevance

A data source is “relevant” if N updates from it may change query result

A data source is “relevant” if a sequence of updates from it may change query result

32

Backup Slides

33

Another Example [1]

1. http://www.cs.wisc.edu/condor/map

34

Example 3

Mach_id Value Event_time

M1 Busy 03/11/2006 20:37:46

M2 Busy 02/10/2006 18:22:01

M3 Busy 03/12/2006 10:23:05

Table 1: Another sample instance for Activity, same Routing table instance as in Example 2

SELECT A.mach_id FROM Routing R, Activity AWHERE R.mach_id = ‘m1’ AND A.value = ‘idle’ AND R.neighbor = A.mach_id;

The query result is empty. The set of data sources relevant via R is Ø, the set of data sources relevant via A is {‘m3’}. Therefore m1 is not relevant, but two updates from m1 will change the query result: 1) m1 is updated to ‘idle’ in Activity, and 2) m1 is added as a neighbor of m1 itself in Routing.

35

Reporting recency Recency information of relevant data sources

are stored in a session duration temporary table

Possible that even the number of relevant sources will be large, so we provide additional summary information: the minimum recency timestamp, the maximum recency timestamp, and the range of recency

Also since machine failures are common in Condor, report exceptional data sources using z-score outlier detection

36

Experiment Setup System: Tao Linux 1.0, 2.4 GHz Intel,

512MB memory Database: PostgreSQL 8.0.0 Shared buffer pool: 8MB Working memory size: 1MB Each query run 11 times and avg time of

the last 10 runs is used to minimize fluctuation.

37

Test Queries Q1: SELECT COUNT(*) FROM Activity A WHERE A.mach_id IN

(‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND A.value = ‘idle’;

Q2: SELECT COUNT(*) FROM Activity A WHERE A.mach_id NOT IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND A.value = ‘idle’;

Q3: SELECT COUNT(*) FROM Routing R, Activity A WHERE R.mach_id IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id AND A.value = ‘idle’;

Q4: SELECT COUNT(*) FROM Routing R, Activity A WHERE R.mach_id NOT IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id AND A.value = ‘idle’;

38

Roadmap Background Definitions and Techniques Prototype and Evaluation Conclusion and Future Work Related Work

39

Enforcing currency and consistency R. Alonso et al., Quasi-copies: Efficient data sharing for information

retrieval systems. In EDBT, pages 443-468, 1988. H. Garcia-Molina et al., Read-only transactions in a distributed database.

ACM Trans. Database Syst., 7(2):209-234, 1982. R. Lenz. Adaptive distributed data management with weak consistent

replicated data. In SAC, pages 178-185, 1996. A. Segev and W. Fang. Currency-based updates to distributed

materialized views. In ICDE, pages 512-520, 1990. A. Labrinidis et al., Balancing performance and data freshness in web

database servers. In VLDB, pages 393-404, 2003. L. Bright et al., Using latency-recency profiles for data delivery on the web.

In VLDB, pages 550-561, 2002. H. Guo et al., Relaxed currency and consistency: How to say “good

enough” in SQL. In SIGMOD Conference, pages 815-826, 2004.

A common theme here is to enforce recency constraints through a combination of choosing the correct version of an object to query (I.e., the cached or the primary copy) or refreshing “stale” objects by synchronously “pulling” new data in response to a query.

40

Data Lineage Y. Cui et al., Lineage tracing for general data

warehouse transformations. In VLDB, pages 471-480, 2001.

Identify the set of source data items that produced a view item, our work is different in that even if a data source doesn’t contribute any lineage data items (possibly due to latency in reporting in or some error), it may still be “relevant”.

41

Distributed Query Processing R. Munz et al., Application of sub-predicate

tests in database systems. In A. L. Furtado and H. L. Morgan, editors, VLDB, pages 426-435, 1979.

The problem statement relies on data items being placed in a distributed environment in such a way that they satisfy various predicates, and then they use the interaction of the data placement predicates and query predicates to identify where data satisfying a simple query might be located

42

Partitioning Pruning D. J. DeWitt et al., Gamma - a high

performance dataflow database machine. In VLDB, pages 228-237, 1986.

IBM. DB2 UDB for z/OS Version 8 Performance Topics, 2005.

Oracle Corporation. Oracle Database Concepts, 10g Release 1, 2003.

Choose partitions by matching certain types of selection predicates with the “partitioning predicates”

TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources

Documents

Transcript of TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources