Adaptively Approximate Techniques in Distributed Architectures · 2015-02-22 · Importance to get...
Transcript of Adaptively Approximate Techniques in Distributed Architectures · 2015-02-22 · Importance to get...
Barbara Catania, Giovanna Guerrini
DIBRIS - University of Genoa, Italy
Adaptively Approximate Techniques in Distributed Architectures
1
What are we talking about?
SOFSEM 2015 2
Query
Answer
User Database Management
System
Rainer Manthey’s talk
This talk
What are we talking about?
SOFSEM 2015 3
How to effectively and efficiently processing
queries in traditional and advanced data
management architectures
Why and how to combine approximation and
adaptivity in advanced architectures
Summary
4
Background and problem statement
ASAP: Approximate Search with Adaptive
Processing
ASAP in the Small
ASAP in the Large
Conclusions
SOFSEM 2015
PART I
Background and problem statement
5 SOFSEM 2015
What are we talking about?
6
How to effectively and efficiently processing queries in traditional and advanced data management architectures
The past The present
SOFSEM 2015
7 SOFSEM 2015
The past
Reference architecture
SOFSEM 2015 8
static data integration
and reconciliation
SELECT,
INSERT,
DELETE, UPDATE
COMMIT/ ROLLBACK
Data
9
Structured data
Data source with a well-known schema
SOFSEM 2015
ID Number PartNum Quantity Price
ID00033 1 XY-47 14 16.80
ID00034 2 B-987 6 2.34
… … … … …
Queries
10
Operational data retrieval operations
Precise queries: the user expects as results all
the objects that precisely meet the request
Declarative languages: SQL standard
SELECT Part_Num
FROM Catalog
WHERE Price > 10
Query processing
SOFSEM 2015 11
(Declarative) query
(Precise) Answer
User DataBase
Management System
Query processing
12
(Precise)
Answer
User
Compiled Query Plan
Query optimizer
Query executor
(Declarative)
query
SOFSEM 2015
DataBase
Management System
13
Crowdsourced
data
Data streams
Semantic (linked)
Data
Large-scale
data distribution
The present
SOFSEM 2015
Emerging features: data
14
Huge (terabytes to
exabytes) amount
of (shared)
information
Different types of data availability - stored data - stream data
Uncertainty due to
data inconsistency,
incompleteness,
ambiguities,
deception, low
freshness
Heterogeneous w.r.t. structure, semantics, quality Geo-referenced, time-variant
Emerging features: queries
SOFSEM 2015 15
Fully exploiting the potential of the huge amount of available data
Pressing need of using these data for goals beyond “routine” processing
Limited knowledge of the user about data to be queried Limited resources with respect to data volumes High system dynamicity
Emerging processing modalities: approximation
16
Precise results are not always possible approximation is a need, in presence of bound
resources and high load Precise results are not always desired approximation (relaxation) is an opportunity for
increasing user satisfaction, in presence of highly heterogeneous data
limited knowledge about data
Even when she knows data, usually she wants only the ‘best results’, in order to avoid flooding best results first (preference-based queries)
Emerging processing modalities: adaptivity
17
Data properties may not be known and estimated a priori
Processing conditions (network load…) vary significantly over time
Importance to get early the first (good quality) results
It may not be possible to determine execution plans before the processing starts
Need to adapt the processing to dynamic conditions, giving up the a priori selection of a single execution strategy, fixed before processing Interleave the optimization and
execution stages
Measure
Analyze
Plan
Actuate
Reference
Vs
Approximation Adaptivity
Traditional Possible Possible
Data Streams Velocity Required, due to
data unboundness
Required, due
to dynamicity
Large-scale Data
Distribution
Volume
Variety
Velocity
Veracity
Required, due to
the high
heterogeneity
Required, due
to elasticity
Beyond traditional query processing
18 SOFSEM 2015
Beyond traditional query processing
Approximation Adaptivity
19
Subject: the query processing
task or the data to which
approximation is applied
Target: the information used
for the approximation
Subject: the processing task
affected by the adaptation
Target: what the technique
attempts at adapting
• Aim: the parameter(s) to be maximized/minimized
SOFSEM 2015
Approximation: traditional environments
SOFSEM 2015 20
Query specification
(by rewriting)
Query specification
(by rewriting)
Data distribution, structure
information
Data distribution, structure
information
Result relevance
Result relevance
Query specification
(preference-based:
top-k, skyline)
Query specification
(preference-based:
top-k, skyline)
Ranking function,
relevant attributes
Ranking function,
relevant attributes
Processing
algorithms
Processing
algorithms
Similarity functions Similarity functions
Data reduction Data reduction Synopsis,
summaries
Synopsis,
summaries
Throughput Throughput
Subject Target Aim
Pruning conditions,
heuristics
Pruning conditions,
heuristics
Adaptivity: traditional environments
SOFSEM 2015 21
Queries over many tables Unreliability of traditional cost estimation, mainly
due to unavailabile and/or out-to-date statistics about attribute correlations and skewed attribute distributions
Subject Target Aim
Query Plans/tuple
routing
Query Plans/tuple
routing
Data characteristics/
query parameters
Data characteristics/
query parameters Throughput Throughput
Which recurrent aims?
Quality of Service (QoS)
Oriented Techniques
Quality of Data (QoD)
Oriented Techniques
SOFSEM 2015 22
Finalized at coping with
limited or constrained
resource availability
during query processing
(with QoD guarantees)
Finalized at improving
the quality of result data
(with QoS guarantees)
QoS parameters QoD parameters
SOFSEM 2015 23
Throughput
CPU usage
Memory consumption
Latency
Communication overhead
…
Accuracy
Coverage
Freshness
…
Which recurrent aims?
How to combine approximation and adaptivity?
24
Aim
ADAPTIVITY
Aim
APPROXIMATION
Quality of Data Quality of Service
Quality of Data
Quality of Service
ASAP: Approximate Search with Adaptive Processing
SOFSEM 2015
What is ASAP
A framework under which defining QoD-oriented approximation techniques which may adaptively change, at run-time, the degree of approximation applied
In ASAP techniques, decisions concerning when, how and how much to approximate are dynamically taken, during the processing, with the goal of improving the quality of result with efficiency guarantees
25 SOFSEM 2015
ASAP in our work
ASAP in the Small Definition of ASAP techniques
for advanced architectures with a limited degree of distribution
ASAP in the Large Investigate ASAP techniques in
highly distributed architectures and emerging contexts
Moving towards a vision ...
Data Streams ASAP in the Small
Large-scale Data
Management ASAP in the Large
26 SOFSEM 2015
PART II
ASAP in the Small
27 SOFSEM 2015
ASAP in the Small
Definition of ASAP techniques for advanced
architectures with a limited degree of distribution
Data Stream Management Systems
Adaptive techniques for combining exact (fast)
and approximate (accurate) relaxed queries over
dynamic (stream) data
28 SOFSEM 2015
Data streams
SOFSEM 2015 29
Data Stream
Management System
Continuous queries
SOFSEM 2015 30
Data Stream
Management System
window
Result
Blocking vs
non blocking operators
Operator semantics
relies on approximation
SELECT *
FROM R [RANGE 5 MINUTES]
[ROWS 4]
Continuous query
Continuous query
Continuous query
Key features
SOFSEM 2015 31
Data unboundness
Dynamic environment
Unknown and dynamic characteristics for data at runtime
Limited resources with respect to incoming data
Increasingly aggressive sharing of resources and computation
Approximation: adding velocity
Load shedding Load shedding Drops of
tuples/probes
Drops of
tuples/probes
32
Subject Target Aim
Memory
consumption
Memory
consumption
CPU usage CPU usage
SOFSEM 2015
Result relevance
Result relevance
Query specification
(preference-based:
top-k, skyline)
Query specification
(preference-based:
top-k, skyline)
Ranking function,
relevant attributes
Ranking function,
relevant attributes
Data reduction Data reduction Sketches Sketches Computability Computability
Limited resources
under fixed plans
(load schedding,
operator
scheduling)
Limited resources
under fixed plans
(load schedding,
operator
scheduling)
Subquery
sharing
Subquery
sharing
Arrival rate Arrival rate
Workload Workload
Throughput
(Output rate)
Throughput
(Output rate)
Memory
consumption
Memory
consumption
Accuracy Accuracy
33
Subject Target Aim
SOFSEM 2015
Query Plans/tuple
routing
Query Plans/tuple
routing Data characteristics,
system conditions
Data characteristics,
system conditions
CPU usage CPU usage
Adaptivity: adding velocity
Relaxed Queries
Relaxation skyline queries [2006, 2012] For each window, only the best tuples according to (a
subset of) the conditions contained in the query are returned to the user
Best tuples [2001] In terms of a domination relationship [2001] between
tuples inside the window distance with respect to query conditions
Never empty result set
Given a precise query, several relaxation skyline queries One for each set of query conditions to be relaxed
34 SOFSEM 2015
Targeted Problem
Precise queries Very efficient for non blocking operators: need a window-based execution only for blocking operators, like joins and aggregates
May decrease user satisfaction: may lead to the empty or few answer problem
Maximal accuracy
Relaxation skyline queries Execution overhead: need a specific window-based execution, even for selection
May increase user satisfaction: avoid the empty-answer or the few-answer problem
May decrease result accuracy
35 SOFSEM 2015
Why ASAP
QoD-oriented approximation Continuous relaxation skyline queries Minimize distance of the result tuples from the
user request
QoD-oriented adaptation Adapting query plans, providing a good
compromise between user satisfaction and efficiency
Maximize accuracy
36 SOFSEM 2015
ASAP technique
Goal Moving from one (possibly relaxed) query to
another, maximizing accuracy during the processing
Adaptivity Adaptively selecting query (execution plans),
either precise or relaxed Decision based on statistics, relying on already
processed data and computed results, and heuristics
37 SOFSEM 2015
ASAP technique
Precise
Continuous
Query
Q
Relaxed
Continuous
Query
Q1
Relaxed
Continuous
Query
Q2
1. A QoD-oriented
user request
2. An accuracy measure
3. A QoD-oriented
adaptive framework
38 SOFSEM 2015
QoD-oriented user request
Constraint provided by the user together with the initial
request
𝝈𝑨𝑽𝑮: average cardinality (selectivity constraint)
𝝅𝑴𝑨𝑿: maximal distance from the specified query
conditions (precision constraint)
𝝁 : weight for selectivity and precision (trade-off
constraint)
Precise Continuous Query annotated
with specific QoD constraints
39 SOFSEM 2015
Accuracy
Q annotated precise query
Q’ precise/relaxed query
Accuracy of Q’ with respect to Q: how far is Q’ result with
respect to Q result
An higher accuracy of Q’ implies an higher user satisfaction in
obtaining Q’ result
Three main components
Precision operator 𝜋, depending on 𝜋𝑀𝐴𝑋
Selectivity operator 𝜎, depending on 𝜎𝐴𝑉𝐺
Trade-off constraint
𝜶 = 𝝈 ∗ 𝝁
𝐥𝐨𝐠𝒄 𝝅 + 𝟏 ∗ 𝟏 − 𝝁 + 𝟏
40 SOFSEM 2015
QoD adaptive framework
Monitor
Collect aggregate values,
based on the query in
execution:
• selectivity
• precision
Assessor
Determine whether some QoD
conditions are satisfied
• 𝑠𝑒𝑙+ : too many results
• 𝑠𝑒𝑙−: too less results
• 𝑟𝑒𝑙𝑎𝑥+: imprecise results
returned
• 𝑟𝑒𝑙𝑎𝑥−: good discarded
tuples
Responder
Based on assessor predicates, determine whether the
query plan should be modified 41 SOFSEM 2015
QoD adaptive framework
Precise
Continuous
Query
Relaxed
Continuous
Query
𝜓0 = ¬𝑠𝑒𝑙− ∨ ¬𝑟𝑒𝑙𝑎𝑥− 𝜓1 = ¬𝑠𝑒𝑙+ ∧ ¬𝑟𝑒𝑙𝑎𝑥+ 𝜓2 = 𝑠𝑒𝑙− ∧ 𝑟𝑒𝑙𝑎𝑥−
𝜓3 = 𝑠𝑒𝑙+ ∨ 𝑟𝑒𝑙𝑎𝑥+
42 SOFSEM 2015
Experimental results
SOFSEM 2015 43
Medium selectivity (50%)
Equal relevance for selectivity
and precision
Amortized accuracy
Amortized processing time
PART III
ASAP in the Large
44 SOFSEM 2015
ASAP in the Large
SOFSEM 2015 45
User interactions with the network and its many applications generate
a valuable amount of information, facts, and opinions with a great
socio-economic potential
This huge wealth of information is currently being exploited much
below its potential because of the difficulties in accessing data to
retrieve relevant information
ASAP in the Large as
a step towards the realization of an entity-relationship search paradigm for uncontrolled and wide information domains
with an impact on qualitative and quantitative performance of systems for processing strongly interrelated and heterogeneous data in distributed dynamic environments
Reference architecture
SOFSEM 2015 46
Open
DBaaS
Which sources?
How?
Data sources
Data from different sources are highly heterogeneous in terms of structure, semantic richness, and quality
Geo-referenced, time-variant, and dynamic
Information sources may contain: strongly related and semantically complex but relatively static data (e.g.,
Linked Open Data)
unstructured data, or data with a simple and defined structure
data dynamically generated by a multitude of diverse people (e.g., social
networks, microblogs)
highly dynamic data generated by public or private institutions linked to
the territory (data streams)
Graph-shaped data model 47 SOFSEM 2015
Requests Complex requests expressing relationships among the
entities of user interest
Users are able to specify such requests only vaguely, since they cannot reasonably know format and structure of data encoding the relevant information
Requests may rely on user profile and request context
Examples: the nearest shops selling the book which my friend Luca
likes the biography of the author of the painting I am
watching
Graph-based query languages
48 SOFSEM 2015
Approximation: adding variety and veracity
49
Partial results Partial results
Subject Target Aim ...
Data reduction Data reduction
Throughput Throughput
Communication
overhead
Communication
overhead
Source selection Source selection Data source
content, quality indicators
Data source content, quality
indicators
Quality of Data (accuracy, coverage,
freshness)
Quality of Data (accuracy, coverage,
freshness)
Latency Latency
Monetary budget Monetary budget SOFSEM 2015
... ...
50
Load balancing Load balancing
Computational
paradigm
Computational
paradigm
Subject Target Aim
Machine capabilities workload
distribution
Machine capabilities workload
distribution
Data and system conditions (#nodes,
failure rate, …)
Data and system conditions (#nodes,
failure rate, …)
SOFSEM 2015
... ... ...
Source selection Source selection
User feedback User feedback
Quality of Data (accuracy, coverage,
freshness)
Quality of Data (accuracy, coverage,
freshness)
CPU utilization CPU utilization
Query specification Query specification
Throughput Throughput
Communication
overhead
Communication
overhead
Latency Latency
Adaptivity: adding variety and veracity
Targeted problem Processing complex requests on heterogeneous and
dynamic information sources can be costly
request interpretation
processing on available sources deemed relevant
aggregation of results in a consistent answer to be returned to the
user
The answer may not guarantee the user satisfaction
it could have been incorrectly interpreted
it could have been processed on inaccurate, incomplete, unreliable
data
it could have required a processing time inadequate to the
urgency of the request
User intervention helps but it is not always possible
51 SOFSEM 2015
«Vision» [DBRank 2013]
User intervention can be limited by
exploiting information on:
a) user context (geo-location, needs,…)
and user profile (interests, habits,…)
b) data and processing quality
c) similar requests repeated over time
52 SOFSEM 2015
«Vision»
a) Context+Profile: overcome the logic of one-size-fits-all without overloading the user with useless results
b) Quality: to distinguish trustworthy sources from lower-quality ones
a) + b) allow us [Weikum, 2011] to choose the level of detail of an answer according to
the user background to prefer concise and timely answers sacrificing the
quality of result in the case of a user on the move or in an emergency situation
53 SOFSEM 2015
«Vision»
c) Information needs may be widespread among different users
during or after an exceptional event (environmental emergencies or flash mobbing initiatives)
users belonging to the same community users that are in the same place, possibly at different
times
common information needs: response times and interpretation errors can be limited taking advantage of the experience gained by prior processing of similar requests
54 SOFSEM 2015
Wearable Query (WQ)
SOFSEM 2015 55
Context information:
spatio-temporal coordinates of the request
its motivation
its environment (e.g., in terms of potential interaction and
urgency)
User profile (provided by the user+induced by the system):
user background and fields of interests
Explicit request annotated with
context and profile information
Enabling «Materials»
Wearable Query Processing
data quality and
dynamicity indicators
user
profile
request
context
knowledge gained
during execution
annotated with context,
profile, quality and
dynamicity measures
Profiled Wearable Query Patterns (PWQPs) - synthetic representations of a set of WQs processed in the past; correspondences among WQs and source portions Source meta information Yellow pages – source indexing on the basis of associated meta-information and represented concepts Mappings – correspondences among different data source portions
56 SOFSEM 2015
Why ASAP
QoD-oriented approximation Wearable queries: explicit request annotated with context and
profile information Minimize distance of the result from context and user
information Maximize accuracy, taking into account metadata generated
by previous WQ executions and data sources
QoD-oriented adaptation The space of sources is incrementally adapted to the
peculiarities of the submitted requests Simultaneous requests are processed by incrementally
adapting them to the peculiarities of the space of sources and its evolution over time
Reduced user interaction
57 SOFSEM 2015
Several issues
SOFSEM 2015 58
Which data summaries?
Which specific data quality measures?
How to take care of geo-spatial information?
Which kind of indexing techniques?
How to manage reusage?
...
Ongoing work
SOFSEM 2015 59
Source data summaries and metadata
information for Linked data and their usage in
Yellow Pages
Automatic acquisition of approximate geo-
spatial contexts for crowdsourced (social)
data
PART IV
Conclusions
60 SOFSEM 2015
Key concepts
SOFSEM 2015 61
No more possible to rely on precise queries
Two enabling concepts: approximation and adaptivity
Useful a classification based on quality: QoD, QoS
Need for combined solutions
Our group: emphasis on QoD-QoD approaches (ASAP)
ASAP in summary
ASAP is not a new concept but a specific revisitation of
existing approaches focusing on QoD parameters
Useful in specific (and more controlled) contexts and even
more relevant when increasing the complexity of the
environment and of the data sources at hand
ASAP in the Large still an ongoing activity, several open
issues
62 SOFSEM 2015
Thank you!