Bubbles – Virtual Data Objects
-
Upload
stefan-urbanek -
Category
Technology
-
view
16.615 -
download
4
description
Transcript of Bubbles – Virtual Data Objects
BubblesVirtual Data Objects
June 2013Stefan Urbanek
data brewery
Contents
■ Data Objects
■ Operations
■ Context
■ Stores
■ Pipeline
Brewery 1 Issues■ based on streaming data by records
buffering in python lists as python objects
■ stream networks were using threadshard to debug, performance penalty (GIL)
■ no use of native data operations
■ difficult to extend
About
Python framework for data processing and quality probing
v3.3
Objective
focus on the process,not data technology
Data
■ keep data in their original form
■ use native operations if possible
■ performance provided by technology
■ have other options
for categorical data
* you can do numerical too, but there are
plenty of other, better tools for that
*
Data Objects
data object represents structured data
Data do not have to be in its final form,
neither they have to exist. Promise of
providing data in the future is just fine.
Data are virtual.
virtual data object
fields
virtual data
SQL statement
iterator
idproductcategoryamountunit price
representations
Data Object
■ is defined by fields
■ has one or more representations
■ might be consumableone-time use objects such as streamed data
SQL statement
iterator
Fields
■ define structure of data object
■ storage metadatageneralized storage type, concrete storage type
■ usage metadatapurpose – analytical point of view, missing values, ...
100Atari 1040STcomputer10400.01985no
integerstringstringintegerfloatintegerstring
typelessnominalnominaldiscretemeasureordinalflag
idproductcategoryamountunit priceyearshipped
Field List
storage type
name
analytical type
(purpose)
sample metadata
SQL statement
iterator
SELECT *FROM productsWHERE price < 100
engine.execute(statement)
Representations
SQL statement that can be composed
actual rows fetched from database
Representations
■ represent actual data in some waySQL statement, CSV file, API query, iterator, ...
■ decided on runtimelist might be dynamic, based on metadata, availability, …
■ used for data object operationsfiltering, composition, transformation, …
Representations
SQL statement
iterator
natural, most efficient for operations
default, all-purpose, might be very expensive
Representations
>>> object.representations()[“sql_table”, “postgres+sql”, “sql”, “rows”]
data might have been
cached in a table
we might use PostgreSQL
dialect specific features...
… or fall back to
generic SQL
for all other
operations
Data Object Role
■ source: provides datavarious source representations such as rows()
■ target: consumes dataappend(row), append_from(object), ...
target.append_from(source)
for row in source.rows(): print(row)
implementation might
depend on source
Append From ...
Iterator SQL
target.append_from(source)
for row in source.rows(): INSERT INTO target (...)
SQLSQL
INSERT INTO target SELECT … FROM source
same engine
Operations
Operation
✽… ? ...
… ? ...… ? ...
… ? ...
does something useful with data object and produces another data object
or something else, also useful
Signature
@operation(“sql”)def sample(context, object, limit): ...
signature
accepted representation
SQL ✽ … ? ...iteratorSQL
@operation
@operation(“sql”)def sample(context, object, limit): ...
@operation(“sql”, “sql”)def new_rows(context, target, source): ...
@operation(“sql”, “rows”, name=“new_rows”)def new_rows_iter(context, target, source): ...
unary
binary
binary with same name but different signature:
List of Objects
@operation(“sql[]”)def append(context, objects): ...
@operation(“rows[]”)def append(context, objects): ...
matches one of common representations of all objects in the list
Any / Default
@operation(“*”)def do_something(context, object): ...
default operation – if no signature matches
Context
Context
SQL iterator
iterator
SQL iterator
✽
✂
⧗
✽
⧗
Mongo ✽
collection of operations
Operation Call
context = Context()context.operation(“sample”)(source, 10)
sample sample
iterator ⇢SQL ⇢iteratorSQL
callable reference
runtime dispatch
sample
SQL ⇢
Simplified Call
context.operation(“sample”)(source, 10)
context.o.sample(source, 10)
Dispatch
SQL ✽iteratorSQL
iterator ✽iterator
MongoDB
operation is chosen based on signatureExample: we do not have this kind of operation
for MongoDB, so we use default iterator instead
Dispatch
dynamic dispatch of operations based on representations of argument objects
PrioritySQL ✽
iteratorSQL
iterator ✽SQL
iterator
order of representations mattersmight be decided during runtime
same representations,
different order
Incapable?
SQL
SQL
join details
A
A
SQL
SQL
join details
A
B
SQL
join details
SQL
same connection different connection
use
this fails
Retry!
SQL
SQL
A
B
iterator
iteratorSQL
join details join details
SQL
retry another
signature
raise RetryOperation(“rows”, “rows”)
if objects are not compose-able as
expected, operation might gently fail and
request a retry with another signature:
Retry when...
■ not able to compose objectsbecause of different connections or other reasons
■ not able to use representationas expected
■ any other reason
Modules
*just an example
collection of operations
SQL Iterator MongoDB
SQL iterator
iterator
SQL iterator
✽
✂
⧗
✽
⧗
Mongo ✽
Extend Context
context.add_operations_from(obj)
any object that has operations as
attributes, such as module
Stores
Object Store
■ contains objectstables, files, collections, ...
■ objects are namedget_object(name)
■ might create objectscreate(name, replace, ...)
Object Store
store = open_store(“sql”, “postgres://localhost/data”)
store factory
Factories: sql, csv (directory), memory, ...
Stores and Objects
source = open_store(“sql”, “postgres://localhost/data”)target = open_store(“csv”, “./data/”)
source_obj = source.get_object(“products”)target_obj = target.create(“products”, fields=source_obj.fields)
for row in source_obj.rows(): target_obj.append(row)
target_obj.flush()
copy data from SQL table to CSV
Pipeline
Pipeline
SQLSQL SQL SQL
Iterator
sequence of operations on “trunk”
Pipeline Operations
stores = { “source”: open_store(“sql”, “postgres://localhost/data”) ”target” = open_store(“csv”, “./data/”)}
p = Pipeline(stores=stores)p.source(“source”, “products”)p.distinct(“color”)p.create(“target”, “product_colors”)
operations – first argument is
result from previous step
extract product colors to CSV
Pipeline
p.source(store, object_name, ...) store.get_object(...)
p.create(store, object_name, ...) store.create(...) store.append_from(...)
Operation Library
Filtering
■ row filtersfilter_by_value, filter_by_set, filter_by_range
■ field_filter (ctx, obj, keep=[], drop=[], rename={})
keep, drop, rename fields
■ sample (ctx, obj, value, mode)
first N, every Nth, random, …
Uniqueness
■ distinct (ctx, obj, key)
distinct values for key
■ distinct_rows (ctx, obj, key)
distinct whole rows (first occurence of a row) for key
■ count_duplicates (ctx, obj, key)
count number of duplicates for key
Master-detail
■ join_detail(ctx, master, detail, master_key, detail_key)
Joins detail table, such as a dimension, on a specified key. Detail key field will be dropped from the result.
Note: other join-based operations will be implemented
later, as they need some usability decisions to be made
Dimension Loading■ added_keys (ctx, dim, source, dim_key, source_key)
which keys in the source are new?
■ added_rows (ctx, dim, source, dim_key, source_key)
which rows in the source are new?
■ changed_rows (ctx, target, source, dim_key, source_key, fields, version_field)
which rows in the source have changed?
more to come…
Conclusion
To Do
■ consolidate representations API
■ define basic set of operations
■ temporaries and garbage collection
■ sequence objects for surrogate keys
Version 0.2
■ processing graphconnected nodes, like in Brewery
■ more basic backendsat least Mongo
■ bubbles command line tool
already in progress
Future
■ separate operation dispatcherwill allow custom dispatch policies
Contact:@Stiivi
databrewery.org