arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by...

23
Towards a Polyglot Data Access Layer for a Low-Code Application Development Platform * Ana Nunes Alonso 1 , João Abreu 2 , David Nunes 2 , André Vieira 2 , Luiz Santos 2 , Tércio Soares 2 , and José Pereira 1 1 INESC TEC and U. Minho, [email protected], [email protected] 2 OutSystems, first.last @outsystems.com Abstract Low-code application development as proposed by the OutSystems Platform enables fast mobile and desktop application development and deployment. It hinges on visual development of the interface and business logic but also on easy integration with data stores and services while delivering robust applica- tions that scale. Data integration increasingly means accessing a va- riety of NoSQL stores. Unfortunately, the diversity of data and processing models, that make them useful in the first place, is difficult to reconcile with the sim- plification of abstractions exposed to developers in a low-code platform. Moreover, NoSQL data stores also rely on a variety of general purpose and custom scripting languages as their main interfaces. In this paper we propose a polyglot data access layer for the OutSystems Platform that uses SQL with optional embedded script snippets to bridge the gap between low-code and full access to NoSQL stores. In detail, we characterize the challenges for integrating a variety of NoSQL data stores; we de- scribe the architecture and proof-of-concept imple- mentation; and evaluate it with a sample applica- tion. * This work was supported by Lisboa2020, Compete2020 and FEDER through Project RADicalize (LISBOA-01-0247- FEDER-017116 | POCI-01-0247-FEDER-017116). 1 Introduction According to Forrester Research, that defines low- code as “enabl[ing] rapid delivery of business applica- tions with a minimum of hand-coding and minimal upfront investment in setup, training, and deploy- ment,” OutSystems is a leading low-code platform for application development and delivery [19]. It empha- sizes drag-and-drop to define the functionality for UI, business processes, logic, and data models to create full-stack, cross-platform applications. Integrating with existing systems increasingly means connecting to a variety of NoSQL data stores, deployed as businesses take advantage of Big Data. Our goal is to enable interactive applications built with the OutSystems Platform [16] to query data in NoSQL stores. The current standard for integrating NoSQL stores with available low-code platforms is for developers to manually define how the available data must be imported and consumed by the platform, requiring expertise in each particular NoSQL store, especially if performance is a concern. Conversely, an ideal OutSystems experience for leveraging Big Data should be: (1) create an inte- gration with the Big Data repository, providing the connection details including credentials; (2) the plat- form introspects the data in the repository and cre- ates a representation for it; (3) the developer trims 1 arXiv:2004.13495v1 [cs.DB] 28 Apr 2020

Transcript of arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by...

Page 1: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

Towards a Polyglot Data Access Layer for a Low-CodeApplication Development Platform∗

Ana Nunes Alonso1, João Abreu2, David Nunes2, André Vieira2,

Luiz Santos2, Tércio Soares2, and José Pereira1

1INESC TEC and U. Minho, [email protected], [email protected], first.last @outsystems.com

Abstract

Low-code application development as proposed bythe OutSystems Platform enables fast mobile anddesktop application development and deployment. Ithinges on visual development of the interface andbusiness logic but also on easy integration with datastores and services while delivering robust applica-tions that scale.

Data integration increasingly means accessing a va-riety of NoSQL stores. Unfortunately, the diversity ofdata and processing models, that make them useful inthe first place, is difficult to reconcile with the sim-plification of abstractions exposed to developers ina low-code platform. Moreover, NoSQL data storesalso rely on a variety of general purpose and customscripting languages as their main interfaces.

In this paper we propose a polyglot data accesslayer for the OutSystems Platform that uses SQLwith optional embedded script snippets to bridgethe gap between low-code and full access to NoSQLstores. In detail, we characterize the challenges forintegrating a variety of NoSQL data stores; we de-scribe the architecture and proof-of-concept imple-mentation; and evaluate it with a sample applica-tion.

∗This work was supported by Lisboa2020, Compete2020and FEDER through Project RADicalize (LISBOA-01-0247-FEDER-017116 | POCI-01-0247-FEDER-017116).

1 Introduction

According to Forrester Research, that defines low-code as “enabl[ing] rapid delivery of business applica-tions with a minimum of hand-coding and minimalupfront investment in setup, training, and deploy-ment,” OutSystems is a leading low-code platform forapplication development and delivery [19]. It empha-sizes drag-and-drop to define the functionality for UI,business processes, logic, and data models to createfull-stack, cross-platform applications.

Integrating with existing systems increasinglymeans connecting to a variety of NoSQL data stores,deployed as businesses take advantage of Big Data.Our goal is to enable interactive applications builtwith the OutSystems Platform [16] to query data inNoSQL stores.

The current standard for integrating NoSQL storeswith available low-code platforms is for developersto manually define how the available data must beimported and consumed by the platform, requiringexpertise in each particular NoSQL store, especiallyif performance is a concern.

Conversely, an ideal OutSystems experience forleveraging Big Data should be: (1) create an inte-gration with the Big Data repository, providing theconnection details including credentials; (2) the plat-form introspects the data in the repository and cre-ates a representation for it; (3) the developer trims

1

arX

iv:2

004.

1349

5v1

[cs

.DB

] 2

8 A

pr 2

020

Page 2: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

down the mapping to the information that is valuablefor the application; (4) the developer is able to querythis information with some transformation capabili-ties which include filtering, sorting, grouping; (5) theplatform handles all the requirements for providingthis information with enterprise-grade non-functionalrequirements (NFRs) such as security, performance,and scalability; and (6) the developer is able to cre-ate applications that can interact with the Big Datarepository and leverage this information in businesslogic and processes, and also to create visualizations.

Delivering the ideal experience for Big Data doeshowever raise significant challenges. First, step (2)is challenged by NoSQL systems not having a stan-dardized data model, a standard method to querymetadata, or even in many cases by not enforcing aschema at all. In fact, even if stored data conform toa well defined schema, it may be only implicit in thecode of applications that manipulate it.

Second, the value added by NoSQL data storesrests precisely on a diversity of query operations andquery composition mechanisms, that exploit specificdata models, storage, and indexing structures. Ex-posing these as visual abstractions for manipulationin step (4) risks polluting the low-code platform withmultiple particular and overlapping concepts, insteadof general purpose abstractions. On the other hand,if we expose the minimal common factor between allNoSQL data stores, we are likely to end up withminimal filtering capabilities that prevent developersfrom fully exploiting NoSQL integration. In eithercase, some NoSQL data stores offer only very mini-mal query processing capabilities and thus force clientapplications to code all other data manipulation op-erations, which also conflicts with the low-code ap-proach. Finally, step (5) requires ensuring that per-formance is compatible with interactive applicationsmeans that one cannot resort to built-in MapReduceto cope with missing query functionality, as it leadsto high latency and resource usage. Also, coping withlarge scale data sets means full data traversals shouldbe avoided. This can be done by exposing relevantindexing mechanisms and resorting to approximateand incomplete data, for instance, when displaying adeveloper preview during step (3).

Our goal is to add the ability to query data in

NoSQL stores to interactive applications creating us-ing the OutSystems platform, preserving the low-code developer experience, i.e. without requiring spe-cific NoSQL knowledge for a developer to success-fully write applications that leverage this type of datastores.

Enabling the seamless integration of a multitudeNoSQL stores with the OutSystems platform will of-fer its more than 200 000 developers a considerablecompetitive advantage over other currently availablelow-code offers. Moreover, Gartner predicts low codeapplication platforms will be used for 65% of all appli-cation development activity in 5 years time[25], am-plifying the impact of our contribution.

In this paper we summarize our work on a proof-of-concept polyglot data access layer for the OutSys-tems Platform that addresses these challenges, thusmaking the following contributions:

• We describe in detail the challenge in integrat-ing NoSQL data stores in a low-code develop-ment platform targeted at relational data. Thisis achieved mainly by surveying how data andquery models in a spectrum of NoSQL datastores match the abstractions that underlie thelow-code approach in OutSystems.

• We propose to use a polyglot query engine, basedon extended relational data and query models,with embedded NoSQL query script fragmentsas the approach that reconciles the expecta-tion of low-code integration with the reality ofNoSQL diversity.

• We describe a proof-of-concept implementationthat leverages an off-the-shelf SQL query enginethat implements the SQL/MED standard [10] formanaging external data.

As a result, we describe various lessons learned, thatare relevant to the integration of NoSQL data storeswith low-code tools in general, to how NoSQL datastores can evolve to make this integration easier andmore effective, and to research and development inpolyglot query processing systems in general.

The rest of the paper is structured as follows. InSection 2 we briefly describe the OutSystems plat-form, how it handles data access, and integrates with

2

Page 3: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

external SQL databases. Section 3 presents the mainresults of an analysis of target data stores focusingon how their characteristics impact our goal. Sec-tion 5 describes our proposal to integrate NoSQLdata stores in the OutSystems platform, includingour current proof-of-concept implementation. Thisproposal is then evaluated in Section 6 and comparedto related work in Section 7. Section 8 concludes thepaper by discussing the main lessons learned.

2 Background

The OutSystems Platform is used to develop, deployand operate a large number of custom applicationsthrough a model driven approach and based on a lowcode visual language [16]. A key trait of the approachis that the most common tasks executed in the cre-ation of an Information System are visually modeled.This is a major contribution for providing accelera-tion to customers. Visual models abstract a lot ofthe complexity of low level coding, are easier to readand therefore reduce the time needed for knowledgetransfer. Having a common modeling language favorsskill reuse either when implementing back end logic,business processes, web UIs, or mobile UIs. A secondkey trait is that customers should never hit a wall,in the sense that the not so common tasks can alsobe achieved, while sometimes not as easily, but stillusing visual modelling and/or extensibility to 3GLlanguages.

2.1 Architecture

The OutSystems solution architecture [13] featuresfive main components, as shown in Figure 1:Service Studio. The development environment

for all the DSLs supported by OutSystems. Whenthe developer publishes an application, Service Stu-dio saves a document with the application’s modeland sends it to the Platform Server.Platform Server. Takes care of all the steps re-

quired to generate, build, package, and deploy na-tive C# web applications on top of a particular stack(e.g., for a Windows Server using SQL Server this will

.

Figure 1: OutSystems Platform Architecture

be ASP.Net and SQL code). The compiled applica-tion is then deployed to the Application Server.Application Server. Runs on top of IIS. The

server stores and runs the developed applicationsService Center. Provides a web interface for

sysadmins, managers and operations teams. Connec-tions to external databases are configured here.Integration Studio. A desktop environment,

targeted at technical developers, used to integrate ex-ternal libraries, services, and databases.

2.2 Data Access

One of the most common operations in data-drivenapplications is to fetch and display data from thedatabase. The OutSystems Platform models the dataaccess through the concept of Entities, elements thatenable information to be persisted in the databaseand the implementation of a database model, follow-ing the relational model to represent data and its re-lationships. A visual editor that allows developmentteams to query and aggregate data visually is alsoprovided, so that developers with any skill set canwork with the complex data needed for any applica-tion.

Using this platform, developers can create inte-grations of local and external data sources without

3

Page 4: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

having to write custom code, significantly reducingtime and effort, and eliminating errors. OutSystemsintegrates natively with several of the major rela-tional database systems: SQL Server, SQL Azure,Oracle, MySQL, and DB2 iSeries. This allows thedevelopment of applications that access data in ex-ternal databases using the OutSystems entity modelwithout having to worry about data migration. Inte-gration with external databases involves: (1) In theService Center, defining a connection to the externaldatabase; (2) In the Integration Studio, mapping ta-bles or views in the external database to entities inan extension module; (3) In the Service Studio, ref-erencing and using the extension in an application.

3 Data stores

NoSQL data stores, in general, forgo the require-ment for a strict structure, sacrificing query process-ing capabilities for flexibility and efficiency. WhileSQL became an ANSI standard, allowing queriesto be executed in compatible RDBMS with few id-iomatic changes, NoSQL data stores offer a varietyof data models, also supporting different data typesand query capabilities. Additionally, most supportheterogeneity at each hierarchical level, allowing, forexample, rows of a given table to have different at-tributes. Values can be any type of object: For doc-ument stores, values can be semi-structured JSONdocuments, for example; for wide-column data stores,values can be variable sets of columns. This flexibilityharms predictability, as additional knowledge (meta-data) of what is being stored is required, as well asthe ability to handle similar data items with irregularstructures.

NoSQL data stores have initially provided onlyminimal query capabilities. In fact, pure key-valuestores offer only minimal get, put, and delete oper-ations. However, query capabilities in NoSQL datastores have been enriched over time: by embeddingdifferent subsets and extensions of SQL, as is the casewith Cassandra, Couchbase, Hive and Cloudera; byevolving increasingly complex native query languagesout of initially simplistic query mechanisms, as is thecase with MongoDB and Elasticsearch; and by actu-

ally offering complex declarative languages, that donot extend SQL but are tailored to their data mod-els, as is the case with Neo4j. In all of these, theincreasing complexity and expressiveness of the lan-guage means that these have actually evolved to havequery engines that, to some extent, perform queryoptimization and execution.

The selection of data stores to be analysed in thecontext of this work was based on two criteria. First,variety, to include the most common NoSQL mod-els, such as document-based, wide-column, pure key-value and graph-based. Second, utility, targeting themost popular data stores, in each class, consideringthe client base of the OutSystems platform.

The minimum requirements for NoSQL data storesto be compatible with the OutSystems platform are:(1) the ability to expose the database schema, toenable the platform user to select the appropriatedatabases, tables, attributes or equivalent constructs,explored in Section 4; (2) a compatible model (andsyntax) for expressing queries considering filteringoperators such as selection and projection, orderingand grouping operators such as sort by and group by,and aggregation operators for counting, summationand average; a compatible structure for consumingquery results, currently restricted to scalar values orflat structures, i.e., no nested data.

Additionally, it would be desirable to: enable visualconstruction of queries based on online partial results;have the ability to take advantage of NoSQL-specificfeatures; and provide client-side query simplification.

While the platform also supports a hierarchicalmodel currently geared towards consuming data fromREST endpoints, data manipulation using the rela-tional model is better supported, as the current queryconstruction interface uses a tabular representationfor data. Also, it is a more natural fit for the guidingprinciples of a low-code platform as it is more likely tobe familiar to developers. Focusing on this route forintegration, for each examined data store, we eval-uate: (1) the available query operations, includingtraversal of data, filtering, and computing differentaggregations (Section 3.1); (2) how such operationscan be combined (Section 3.2); and (3) how to mapsuch operators and compositions to the OutSystemsmodel (Section 3.3).

4

Page 5: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

3.1 Query Operators

The fitness of the relational model for interfacing witheach NoSQL data store rests on: (1) the ability to ex-press a given operation in the data store’s model andquery language, even if some translation is requiredand (2) on results being either scalar values or con-vertible to a tabular form.

In order to use a relational notation with bothdocument-based and row-based models supportingnested structures (including collections) these mustbe flattened. One way to do this is to promote thefields of the nested structure to outermost attributesand unnest lists, creating a separate row for each el-ement of the list.

As an example, consider the MongoDB documentshown in Figure 2. This document can be mod-elled as a relation, where the name of each attributeis prefixed with a path to reach it in the originalmodel. The result from converting the document tothis model is depicted in Figure 3. Data from otherdocuments in the same collection could be added asrows. If the schema is partial (or probabilistic), somerows will likely be missing some attributes. Natu-rally, this step requires the schema to be retrievable,whether explicit or inferred, a concern addressed inSection 4.

This method can be extended to graph databases,for acyclic schemas. Using Neo4j as an example, con-sidering the information that can be retrieved regard-ing node and edge properties, a relational schemacould be exposed with a relation per node type, withproperties as attributes and a relation per edge type(relationship type in Neo4j). However, taking fulladvantage of the capabilities of the graph model maynot be as straightforward, such as, for example, usingthe length of the path in a query.

Consider, instead, a key-value data model with col-lections, such as Redis. While extracting a high-levelschema for the stored data from Redis is not gen-erally feasible, it can be done on a per key basis,first by listing the keys and querying the type of theassociated value. Each value type can be straight-forwardly converted to a tabular format, except for

HyperLogLogs, a probabilistic data structure used tocount unique items. Selection (filtering rows/records)and projection (filtering columns/attributes) opera-tors are generally supported in NoSQL data stores,even if projections require defining paths on possiblynested structures.

The commands/keywords to make use of a givenfeature in the particular data store are presented, inFigure 4, as a command where available, or a shortexplanation of what is available, where convenient.

MongoDB is a JSON-centric database system, us-ing it both as its data model and as its query lan-guage. It has thus been highly successful for Webapplications where used together with JavaScript.Objects are grouped in collections, that are storedin databases. Each object is identified in the con-text of a collection by the _id field for direct access.This field can be provided by the application withany relevant content, or is otherwise automaticallygenerated with a unique ObjectId. MongoDB canstore relational data by introducing reference fieldsthat hold the value in the_id of the referenced docu-ment. Nonetheless, MongoDB encourages denormal-ization and storing related entities as a hierarchicalstructure in a single document. The native API forMongoDB is a JavaScript library, where queries areexpressed as JSON documents. Similar APIs existhowever for a variety of programming languages andplatforms. First, this interface allows the applicationto insert, retrieve, and delete documents directly ref-erenced by their _id. Second, it allows searching fordocuments providing a document for matching. Atits simplest, this includes key-value pairs that need toexist in the target documents. It allows also severalrelational and boolean operators to be specified withkeywords prefixed by the $ symbol, allowing com-plex expressions to be composed. In update opera-tions, the new value can also be computed by using aset of functions accessed also with $ keywords. Thedb.collection.find() method retrieves all docu-ments that match a set of criteria defined over thefields of the document. While, by default, all fieldsof matching documents are retrieved, the set of fieldsto be projected can be defined in a projection doc-ument. Projection of elements in nested arrays canalso be controlled. Couchbase is, like MongoDB, a

5

Page 6: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

{_id: (hidden)id: "store::1",location: "Braga",sells: [{ widget: { id: "Widget1", color: "red" }, qty: 5 },{ widget: { id: "Widget2", color: "blue" }, qty: 2 },

]}

Figure 2: Example MongoDB document for a widget store.

id location sells.widget.id sells.widget.color sells.widget.qty

store::1 Braga Widget1 red 5

store::1 Braga Widget2 blue 2

store::2 Braga Widget1 blue 2

Figure 3: Representation of a MongoDB documentin a relational model.

document store that uses JSON as its data model.It provides a simple key-value interface and N1QL,a rich SQL-like query language. The outermost dataorganization units are buckets, which group key-valuepairs (items). Keys are immutable unique identi-fiers, mapped to either binary or JSON values. LikeMongoDB, relations between documents can be cre-ated either by referencing other documents by keyor through embedding, creating a hierarchical struc-ture. Selection and projection are similar to SQL’sbut dealing with nested structures such as arrays, re-quires explicitly unnesting these, using the UNNESTclause.

Redis is an in-memory key-value store using stringsas keys and where values can be strings or data struc-tures such as lists, sorted or unsorted sets, hashmaps, bit arrays and HyperLogLogs, probabilisticdata structures used to count unique items, tradingin precision for space savings. Querying over valuesis not supported, only queries by key. The alterna-tive is costly: a full scan, returning key-value pairsthat match a specified condition. In terms of projec-tion, either a single attribute is retrieved or all are.

Redis is used primarily for caching, with optional per-sistence. Another usage pattern is to use Redis forwrite-intensive data items, using a backing data storefor other data.

Apache Cassandra is a wide-column store inspiredby BigTable. Cassandra, however, supports supercolumn families as an additional aggregate. It sup-ports CQL, a SQL-like query language. Queries thatuse relation traversal, i.e. the equivalent to JOINs inthe relational model, require either custom indices orsecondary indices on would-be JOIN attributes to beavailable. Index creation is asynchronous.

Apache HBase is also a wide-column store, mod-elled on Google BigTable and implemented on theHadoop stack. HBase does not constrain data storedin keys, qualifiers (attributes), and values. Dataitems have therefore to be serialized to byte arraysbefore storage and deserialized back upon retrieval.HBase natively offers a simple Java client API andno query language. Selection is implemented by ei-ther retrieving a single or multiple rows, given thekeys and an optional upper bound on the timestamp,or by scanning a range of rows, given (optional) startand stop keys, optionally providing a filter expressioncomparing a value to a provided constant. Projec-tion is supported through the definition of a subsetof qualifiers to be retrieved.

Amazon DynamoDB is a large scale distributeddata store offered as part of Amazon Web Services.Data is stored in tables, where rows are identified bysimple or composite primary keys. If using a simpleprimary key, it serves as the partition key. Compositeprimary keys are limited to two attributes, where the

6

Page 7: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

first attribute is also used as a partition key and thesecond is used for sorting the rows. Selection and pro-jection capabilities are similar to those described forHBase. Answering queries that use relation traversalefficiently requires creating secondary indexes. How-ever, the number of indices that can be created islimited. The costly alternative is to use table scansand combine data client-side.

Elasticsearch is a document store focusing on textindexing and retrieval, primarily developed as a dis-tributed version of the well known Lucene text in-dexing library. It uses the well known JSON formatas its base data model and groups JSON objects incollections called “indexes". Each object in an indexhas a unique identifier used for direct access. It isexpected that each of these indexes contains a fairlyhomogeneous set of objects, with mostly the same at-tributes and being used for the same purpose. Elas-ticsearch allows a very limited form of relational databy introducing document parent-child relationships.This allows splitting very large documents that con-tain many sub-elements and would be inefficientlymanaged, while maintaining the ability to traversethem together in some operations. This possibilityhas however many restrictions and the documenta-tion emphatically discourages its generalized use. Se-lection and projection are supported either in the na-tive API or an experimental SQL-like API.

Neo4j is a graph data management system basedon the Java platform. The graph data model and op-erations make it a very different alternative to mostother well known NoSQL systems. Information isrepresented as a directed graph defined as a set ofnamed nodes and a set of edges, each connecting asource node to a target node. Additional informa-tion can then be attached to nodes and edges in oneof two ways: as a single tag, that is a symbolic iden-tifier with particular significance as metadata and forindexing; or as a set of named properties, that canhave various types and hold arbitrary data. Valuesof types attached to properties, that can be manipu-lated in queries, can be both primitive types such asnumbers or strings, but also composite values such aslists and maps. This means that the data attached tonodes and edges can be nested in structures as typicalof JSON documents. The main interface to Neo4j is

Cypher, a declarative graph query and manipulationlanguage. It natively supports graph concepts such aspaths and matching on graph structure. Along withSQL-like selection and projection capabilities on thedata attached to nodes and edges, attributes such asthe length of a path can also be specified.

Apache Hive and Cloudera Impala are analyticquery engines for a SQL-like query language using,preferably, data stored in HDFS files. Hive is builton top of Hadoop and translates queries to MapRe-duce jobs. Impala translates queries to a traditionalVolcano-style parallel execution plan. They share thesame client API, most of the language, and some ofthe infrastructure, namely, for storage and metadata.Data can be organized in rich hierarchical structures,that are however strongly typed. Nevertheless, pro-cessing is done in terms of flat rows.

Figure 4 also presents the output format for queryresults. While some of the analysed data storespresent results in document or row-based data modelswith nested structures, these can be easily convertedto a tabular format using the already discussed tech-niques, fulfilling requirement (2).

Regarding sorting, grouping and aggregation op-erators, Figure 5 shows the commands/keywords tomake use of the given feature in the particular datastore, presented as a command where available, ora short explanation of known limitations where con-venient. Redis only sorts elements in a data struc-ture (list, set, sorted set) stored for a specific key.Again, support for SQL-like commands in Elastic-search is experimental. Cassandra supports sortingonly on one of a table’s attributes, the first cluster-ing column, while grouping can only be done usingthe partition key. DynamoDB has similar restric-tions on ordering. Aggregates require creating in-dexes/materialized views, precluding ad-hoc queries.Results in HBase are sorted in a fixed order, using therow key for the outermost comparison. Aggregatescan be implemented in HBase using coprocessors.Remaining data stores implement sorting, groupingand aggregation operators that are compatible withSQL’s, indicated in the column header.

7

Page 8: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

Selection Projection Output

MongoDB db.collection.find() projection document $, $elemMatch

JSON

Couchbase WHERE clause SELECT JSON

Redis by key or SCAN with MATCH

single attribute or all scalar or flattened array

Cassandra WHERE clause (partial) SELECT tabular

HBase by key or SCAN with filter

addFamily() addColumn()

flattened map

DynamoDB by key or SCAN with FilterExpression

ProjectionExpression JSON

ElasticSearch match clause WHERE

_source SELECT

JSON tabular

Neo4j MATCH … WHERE RETURNING tabular or JSON

Hive/Cloudera WHERE clause SELECT tabular

Sorting (ORDER BY) Grouping (GROUP BY) Aggr. (COUNT, SUM, …)

MongoDB $orderby $group yes

Couchbase ORDER BY GROUP BY yes

Redis SORT <key> (partial) no no

Cassandra ORDER BY (partial) GROUP BY (partial) yes

HBase not ad-hoc not ad-hoc not ad-hoc

DynamoDB partial/fixed not ad-hoc not ad-hoc

Elasticsearch sort/terms.order

ORDER BY

terms

GROUP BY

sort.mode

yes

Neo4j ORDER BY RETURN yes

Hive/Cloudera ORDER BY GROUP BY yes

Figure 4: Selection and Projection operators inNoSQL data stores.

Selection Projection Output

MongoDB db.collection.find() projection document $, $elemMatch

JSON

Couchbase WHERE clause SELECT JSON

Redis by key or SCAN with MATCH

single attribute or all scalar or flattened array

Cassandra WHERE clause (partial) SELECT tabular

HBase by key or SCAN with filter

addFamily() addColumn()

flattened map

DynamoDB by key or SCAN with FilterExpression

ProjectionExpression JSON

ElasticSearch match clause WHERE

_source SELECT

JSON tabular

Neo4j MATCH … WHERE RETURNING tabular or JSON

Hive/Cloudera WHERE clause SELECT tabular

Sorting (ORDER BY) Grouping (GROUP BY) Aggr. (COUNT, SUM, …)

MongoDB $orderby $group yes

Couchbase ORDER BY GROUP BY yes

Redis SORT <key> (partial) no no

Cassandra ORDER BY (partial) GROUP BY (partial) yes

HBase not ad-hoc not ad-hoc not ad-hoc

DynamoDB partial/fixed not ad-hoc not ad-hoc

Elasticsearch sort/terms.order

ORDER BY

terms

GROUP BY

sort.mode

yes

Neo4j ORDER BY RETURN yes

Hive/Cloudera ORDER BY GROUP BY yes .

Figure 5: Sorting, Grouping and Aggregation opera-tors in NoSQL data stores.

3.2 Query Construction

Elementary operators such as projection and selec-tion can be combined into complete query expres-sions. On one extreme, data stores with limited querycapabilities allow only simple fixed combinations ofsupported operators. For instance, HBase allows pro-jection and selection to be optionally specified foreach query. On the other extreme, data stores allowarbitrary combinations of operators to be submittedand often perform an optimization step, reorderingand transforming the proposed expression for efficientexecution. This section focuses on the latter, namelyon the MapReduce, pipeline and tree query construc-tion paradigms.

MapReduce is a particular query construction ap-proach as it always assumes an intermediate groupingoperation. Briefly, it works as follows: The first step,Map, allows an arbitrary function to translate eachdata item in an input collection to zero or more out-put keys and values, usually, to filter and transformdata. In a second step, data items are grouped bykey and Reduced, allowing an arbitrary function toprocess the set of values associated with each key,usually to aggregate them. Often, multiple map andreduce stages can be used, thus allowing more com-plex queries that group by different criteria in differ-ent stages. MapReduce jobs are mostly used to asyn-chronously perform table scans and build secondaryindexes, that enable these stores to offer querying ca-pabilities that are expected of the relational model:materialised views, aggregations, etc. MapReducecan be used to execute relational equi-JOINs by usingthe join key for grouping, as generating all possiblematches with a reducer. In this perspective, con-suming MapReduce results can be done within therelational model, as long as these come in (or areeasily translatable to) tabular form, possibly with aninferred schema. First, MapReduce focuses on ana-lytical workloads, that read most of or all input data,resulting in poor interactive performance. Moreover,limiting the specification of MapReduce tasks to SQLmay be undesirable. An argument can be made forconsidering that the specification of MapReduce jobsrequires sufficient expertise to fall outside of the typ-ical utilization of the platform, making it available

8

Page 9: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

through a low-level extension.A more general query construction approach is to

allow arbitrary operators to be chained in a linearpipeline. The first operator reads data, often withan implicit selection and projection. Each operatordoes some transformation, such as grouping and ag-gregation, and sends data to the next. An example ofthis approach is provided by MongoDB’s aggregationpipeline. Figure 6 demonstrates how this approachcan be used over a denormalized schema, in which alldata for a store is kept in a single document:

While this is not the most natural way to expressthis query in MongoDB, it is however a simple ex-ample of an aggregation pipeline. The query firstfilters stores located in “Braga" and projects only therequired fields. Then it unwinds the “sells" array,producing a document for each item sold. It thenprojects the required fields and filters widgets withcolor “red". This approach has several advantages.First, it can express a large subset of all queries pos-sible with SQL. In fact, the linear chain of operatorsprecludes only JOIN operations where more than onedata source would be required, and some sub-queries,i.e., mainly those that are equivalent to JOIN oper-ations. Second, the pipeline transformation of datais intuitive, as the same paradigm is used in variousforms in computer interfaces. Finally, data storessupporting this paradigm also tend to include queryoptimizers, that reorder operators to produce an effi-cient execution plan. This means that a query can beincrementally built without being overly concernedwith its efficiency.

The most general approach is to allow arbitrarydata query expressions that translate to operatortrees. This allows expressing relational JOIN op-erations between data resulting from any two sub-queries. It also allows a general use of sub-querieswhose results are used as inputs to other operators.The main advantage of this approach is that it is whathas traditionally been done with SQL databases.Ironically, NoSQL data stores therefore increasinglyprovide SQL-like dialects and query optimizers. Thisallows existing low-code query builders designed forSQL databases to be reused with minor modifica-tions, mainly, to what functions and operators areavailable.

Figure 7 summarizes the supported query con-struction paradigms per data store, presenting eitherthe method or concept that provides the capability.

Redis supports only a key-value interface, there-fore, having no query construction capabilities. Ama-zon’s DynamoDB supports only simple queries by pri-mary key (or a range thereof) with optional filteringapplied just before the data is returned.

3.3 Discussion

The main conclusion of the analysis is that the rela-tional model is to a large extent able to map enough ofNoSQL data stores to fit the OutSystems’ low codeplatform. This means that the current aggregationmodel, designed primarily for relational data, can beadequately used to describe most query operators andconstruction schemas found in NoSQL data stores.This is a consequence of two aspects. First, of thecompleteness of the relational model, that can mapmost operators and query construction approachesfound in NoSQL data stores, requiring only an ex-tension to support nested data. Second, of the trendfor NoSQL data stores to increasingly provide SQLsupport, in particular, by using unwind/unnest oper-ations to deal with denormalized and nested data.

An exception to this is the ability to directly de-scribe a MapReduce operation (MongoDB, Couch-base, HBase). The MapReduce paradigm fits in thisvision as a background workhorse to pre-process data,e.g. creating secondary indexes or materialized views,to endow NoSQL data stores with sorting, groupingor aggregation operations required for approximatingSQL support. Some data stores provide this func-tionality automatically, which others require user in-tervention. In any case, as long as the results of aMapReduce job can be converted to a nested tabularformat, these can be consumed in the context of therelational model. When not supported by the datastore, some SQL features can be implemented withclient-side processing or a middleware solution. Forexample, projection can be fully supported in Rediswith client-side filtering before returning data to theplatform user. A second exception is needed to takeadvantage of specific NoSQL features. For example,to take advantage of features such as the added query-

9

Page 10: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

{ aggregate: "stores", pipeline: [{ $match: { location: { $eq: "Braga" } } },{ $project: { id: 1, location: 1, sells: 1 } },{ $project: { expr000: "$sells", expr001: "$id" } },{ $unwind: "$expr000" },{ $project: { sid: "$expr001", ITEM: "$expr000.widget.color" } },{ $match: { ITEM: { $eq: "red" } } },{ $project: { sid: 1 } }

], cursor: { batchSize: 4095}, allowDiskUse: true, $readPreference: { mode: "secondaryPreferred"}, $db: "storesdb"

}

Figure 6: Pipeline query construction.

ing features the graph model brings Neo4j, more ex-pertise than intended for using a low-code platformmay be required. Still, this can be supported by en-abling advanced users to edit the query itself. On adifferent note, Redis exposes its data as simple datastructures. It might make sense to allow the user tocompose several of these structures into a material-ized view, providing, in fact, an external schema forthe stored data.

To summarize the expected development effort:

1. There is no need to modify (any of these) NoSQLdata stores for integration with the OutSystemsplatform.

2. Client-side or in-middleware computation shouldbe limited to: (a) when needed, converting re-sults to a tabular format (b) when needed, fillingin for operators that are either missing from thedata store or cannot be used ad-hoc; and (c) con-vert SQL to native query languages or SQL-likedialects.

3. Add support for nesting and unnesting opera-tions to the OutSystems platform (see Figure 4).

4 Schema Discovery

Offering a uniform view of unstructured data requiresdata types to be well defined and the definition of aschema to which the data must conform. A schema

is a formal description that exposes the underlyingstructure of data.

Currently, for integration with a data source, theplatform must be able to list available databases, listtables per database and to retrieve a schema for thedata. Providing a schema to the platform is key tothe low-code approach for data integration, as pre-senting the schema to the user, along with data sam-ples, is required for queries to be defined visually.

For relational databases, the schema consists of re-lations or entities (typically represented as tables),their attributes (including data types) and, poten-tially, the cardinality associated to the referentialrelationships between them. For NoSQL databasesbased on a document model, using MongoDB as anexample, the schema for a database should includeexisting collections, document fields for each collec-tion, including data types and, potentially, referentialrelationships. However, while relational schemas areflat, strict and deterministic, the flexibility of addingdocuments with different fields to the same collection,having fields with the same name holding values ofdifferent types and the ability to define nested struc-tures, make it harder to elicit a complete and accurateschema of the data.

4.1 Third-party toolsFigure 8 summarizes the schema-related metadatathat can be retrieved or inferred from each of theanalysed data stores, using currently available tools.

10

Page 11: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

MapReduce Pipeline Tree

MongoDB db.collection.mapReduce() db.collection.aggregate() no

Couchbase "views" no N1QL

Redis no no no

Cassandra no no CQL

HBase co-processors(no grouping) no no

DynamoDB no execute, then filter no

Elasticsearch no “aggregations” no

Neo4j no no Cypher QL

Hive/Cloudera no no SQL

list databases list tables/collections retrievable schema

MongoDB yes yes inferred, probabilistic

Couchbase yes yes inferred, probabilistic

Redis yes yes (keys) generally N/A

DynamoDB N/A yes explicit, partial

Cassandra yes yes explicit

HBase N/A yes no

Hive/Cloudera yes yes explicit

ElasticSearch N/A yes (indexes) inferred, partial

Neo4J N/A yes (nodes/rels.) inferred, partial

.

Figure 7: Query construction paradigms in NoSQLdata stores.

MapReduce Pipeline Tree

MongoDB db.collection.mapReduce() db.collection.aggregate() no

Couchbase "views" no N1QL

Redis no no no

Cassandra no no CQL

HBase co-processors(no grouping) no no

DynamoDB no execute, then filter no

Elasticsearch no “aggregations” no

Neo4j no no Cypher QL

Hive/Cloudera no no SQL

list databases list tables/collections retrievable schema

MongoDB yes yes inferred, probabilistic

Couchbase yes yes inferred, probabilistic

Redis yes yes (keys) generally N/A

DynamoDB N/A yes explicit, partial

Cassandra yes yes explicit

HBase N/A yes no

Hive/Cloudera yes yes explicit

ElasticSearch N/A yes (indexes) inferred, partial

Neo4J N/A yes (nodes/rels.) inferred, partial

.

Figure 8: Retrievable schema-related metadata fromeach NoSQL data store.

Cells marked N/A indicate there is no comparableconcept in the store’s data model. Partial schemainference refers to cases where nested structures areopaque.

A wide range of NoSQL data stores explicitly storeand provide metadata, a confirmation of the JSONdata model, that allows nested data collections, as acommon trend for data structuring and representa-tion. There are however minor differences in whatdata types can be attached to values and even howmuch nesting is allowed.

Second, NoSQL data stores vary widely in whichmetadata can be obtained. On one end, systemssuch as Cassandra, Hive, and Cloudera Impala storeand enforce a user-defined schema that unambigu-ously describes data. On the other end, systemssuch as HBase do not provide any support for settingor enforcing a schema. As they store just arbitrarybyte sequences, the schema cannot also be easily in-ferred. They are however frequently used with third-party query mechanisms, that provide the missingmetadata. HBase, for example, is often used in theHadoop stack with Hive. In between, systems such asMongoDB, Couchbase, Elasticsearch and Neo4j allowthe schema to be inferred from currently stored dataand even provide optional mechanisms to enforce it.For example, using MongoDB it is possible to definea set of rules over the attributes of documents be-longing to a given collection. Validation is definedat the collection level and conditions can require adocument to contain a given set of attributes, witha given type, set attribute-level boundary conditions,or require values to match a given regular expression.In effect, introducing validation limits the variabilityof the structure of JSON documents.

Mechanisms used to infer schema from MongoDBand Couchbase can, in principle, be applied to otherschema-less data stores, by adapting existing tools orby implementing similar ones. In short, tools suchas mongodb-schema1analyze a sample of documentsstored in a given collection and provide, as outcome,a probabilistic schema. Fields are annotated with aprobability according to how frequently these occurin that collection’s sampled documents. Fields are

1https://github.com/mongodb-js/mongodb-schema

11

Page 12: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

also annotated with a set of data types: each datatype is itself annotated with a probability accordingthe mapping’s occurrence in the sampled documents.There is also mongodrdl2, a tool for inferring a rela-tional schema from a MongoDB database or collec-tion, which, however, in our experiments, fell short ofaccurately representing some relationships betweenunnested array elements and top-most attributes.

A similar concern holds for using data sources withpolyglot query engines, as these enable expressingdata processing operations over systems that exposemultiple native data models and query languages.

Dremio OSS performs schema inference, but treatsnested structures as opaque and, therefore, does notcompletely support low-code construction of unnest-ing operations, in the sense that the user still needsto explicitly handle these. Still, it provides the abil-ity to impose a table schema ad-hoc or flexibly adaptdata types which is a desirable feature for overridingincorrect schema inference.

With PostgreSQL FDW, it is possible to declaretables for which query and manipulation operationsare delegated on adapters. The wrapper interface in-cludes the ability to either impose or import a schemafor the foreign tables. Imposing a schema requires theuser to declare data types and structure and it is upto the wrapper to make it fit by using automatic typeconversions as possible. If this automatic process isnot successful the user will need to change the spec-ified data type to provide a closer type match. Thewrapper can also (optionally) advertise the possibil-ity of importing a schema. In this case, the user sim-ply instructs PostgreSQL to import meta-data fromthe wrapper and use it for further operations. Thiscapability is provided by the wrapper and currently,this is only supported for SQL databases, for whichthe schema can be easily queried. Furthermore, Post-greSQL FDW can export the schema of the createdforeign tables. Both for Dremio and PostgreSQL,limitations in schema imposition/inference do not im-pact querying capabilities, only the required talentto use the system. For PostgreSQL FDW, this canbe mitigated by extending adapters to improve sup-

2https://docs.mongodb.com/bi-connector/current/reference/mongodrdl/

port for nested data structures, integrating schemainference/extraction techniques, as proposed and de-scribed in Section 5.1.

4.2 Improving current tools

Here, we briefly consider research results, for whichrunnable systems are not generally available, butwhich nonetheless can contribute relevant ideas andtechniques. Generic schema inference/discovery fo-cuses on discovering, for a set of semi-structureddata, how these can be generically represented assets of attributes (entities) of a given type (or setthereof), optionally identifying relationships betweenentities (e.g., references) and constraints (e.g. that agiven attribute is required to be non-null). Proposalsare typically motivated by the necessity of designingclient applications that can take advantage of a semi-structured data source for which a schema is unknownwith earlier work focused mainly on XML data. Oneapproach, is to involve the user in schema discoveryby exposing generated relational views of the JSONdata to users so that these can help clusters recordsas collections and mark or validate relationships be-tween entities [22]. While the goal of this partic-ular work is to ultimately provide a flat relationalschema of the JSON data, concerns such as minimis-ing the number of collections in the final schema andintroducing relationships between entities might havea significant impact in the effectiveness of queryingJSON data, an aspect that is not assessed by theauthors. This type of approach seems to be bettersuited for data exploration than for integrating datasources with applications in a low code setting.

It has also been proposed that machine learningcan be used to generate a relational schemas fromsemi-structured data [6]. However, while the au-thors did perform a preliminary assessment of thequery performance on the generated schemas, querieswere performed on the data loaded onto a relationaldatabase. Results from this assessment do not nec-essarily hold when queries (operations) are pusheddown to a NoSQL store.

A commonly identified pattern is that the sameconceptual object can be represented by documentsthat differ in a subset of fields, or have fields with

12

Page 13: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

different types, as a consequence of the schema-lessnature of semi-structured data. This effect can becaptured as coalescing these slightly different effectiveschemas as different versions of the same documentschema. In [18], the authors propose a method basedon model-driven engineering to do just that.

A significantly different approach for schema dis-covery is to analyse application source code to dis-cover the schema implicitly imposed by the applica-tion on schema-less data sources. In [4], the authorspropose such a method for applications that use re-lational and NoSQL data sources. While currentlyout-of-scope, it might be interesting to offer this typeof capability to ease the migration of applications tothe OutSystems platform.

5 Architecture

Considering the conclusions from surveying a vari-ety of NoSQL systems, in particular, regarding theirsupported data and query models, we describe thearchitecture for a polyglot data access layer for a low-code application platform, and then discuss a proof-of-concept implementation based on existing opensource components.

Our proposal is based on two main criteria. First,how it contributes to the vision of NoSQL data inte-gration in the low-code platform outlined in Section 1and how it fits the low-code approach in general.Second, the talent and effort needed for developingsuch integrations and then, later, for each additionalNoSQL system that needs to be supported.

We can consider two extreme views. On the onehand, we can enrich the abstractions that are ex-posed to the developer to encompass the data andquery processing models. This includes: data typesand structures, such as nested tuples, arrays, andmaps; query operations, ranging from general pur-pose data manipulation (e.g., flattening a nestedstructure) to domain-specific operations (e.g., regard-ing search terms in a text index); and finally, whereapplicable, query composition (e.g., with MapReduceor a pipeline).

This approach has however several drawbacks.First, it pollutes the low-code platform with a variety

of abstractions that have to be learned by the devel-opers to fully use it. Moreover, these abstractionschange with support for additional NoSQL systemsand are not universally applicable. In fact, supportfor different NoSQL systems would be very different,making it difficult to use the same know-how to de-velop applications on them all. Finally, building andmaintaining the platform itself would require a lotof talent and effort in the long term, as support foradditional systems could not be neatly separated inplugins with simple, abstract interfaces.

On the other hand, we can map all data in differentNoSQL systems to a relational schema with standardtypes and allow queries to be expressed in SQL. Thisresults in a mediator/wrapper architecture that al-lows the same queries to be executed over all dataregardless of its source, even if by the query engineat the mediator layers.

This approach also has drawbacks. First, mappingNoSQL data models to a relational schema requiresdeveloper intervention to extract the view that is ade-quate to the queries that are foreseen. This will mostlikely require NoSQL-specific talent to write targetqueries and conversion scripts. Moreover, query ca-pabilities in NoSQL systems will remain largely un-used, as only simple filters and projections are pusheddown, meaning the bulk of data processing wouldneed to be performed client-side.

Our proposal is a compromise between these twoextreme approaches, that can be summed up as: sup-port for nested data and its manipulation in the ab-stractions shown to the low-code developer, alongwith the ability to push aggregation operations downto NoSQL stores from a mediator query engine, willaccount for the vast majority of use cases. In addi-tion, the ability to embed native query fragments inqueries will allow fully using the NoSQL store whentalent is available, without disrupting the overall inte-gration. The result is a polyglot query engine, whereSQL statements are combined with multiple foreignlanguages for different NoSQL systems.

The proposed architecture is summarized in Fig-ure 9, highlighting the proposed NoSQL data ac-cess layer. To the existing OutSystems platform, en-compassing development tools and runtime compo-nents, we add a new Polyglot connector, using the

13

Page 14: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

Figure 9: Architecture overview

Database Integration API to connect to the PolyglotQuery Engine (QE) through standard platform APIs.The Polyglot QE acts as a mediator. It exposes anextended relational database schema for connectedNoSQL stores and is able to handle SQL and poly-glot queries.

For each NoSQL Store, there is a Wrapper, com-posed of three sub-components: metadata extraction,responsible for determining the structure of data inthe corresponding store using an appropriate methodand mapping it to the extended SQL data model ofthe Polyglot QE; a query push-down component, ableto translate a subset of SQL query expressions, torelay native query fragments, or produce a combina-tion of both in a store-specific way; and finally, thecursor, able to iterate on result data and to translateand convert it as required to fit the common extendedSQL data model.

The Polyglot QE makes use of Local storage for theconfiguration of NoSQL store adapters and for hold-ing materialized views of data to improve response

times. The Job Scheduler enables periodically re-freshing materialized views by re-executing their cor-responding queries.

5.1 Implementation

We base our proof-of-concept implementation onopen source components. In this section we startby describing how we selected those components andthen describe the additional development needed tomake it fit the proposed architecture. We baseour proof-of-concept implementation on open sourcecomponents.

5.1.1 Component selection

The main component to select is the SQL queryengine used as the mediator. Besides its featuresas a query engine, we focus on: the availability ofwrappers for different NoSQL systems and the talentneeded to implement additional features; the com-patibility of the open source license with commer-cial distribution; the maturity of the code-base andsupporting open source community; and finally, onits compatibility with the OutSystems low-code plat-form. We consider two options.PostgreSQL with FDW[17]. It is an option as

it supports foreign data wrappers according to theSQL/MED standard (ISO/IEC 9075-9:2008). Themain attractive for PostgreSQL is that it is a verymature open source product, with a business friendlylicense, a long history of deployment in production,and an unparalleled developer and user community.There is also support for .NET and Java client appli-cation platforms. In terms of features, PostgreSQLprovides a robust optimizer and an efficient queryengine, that has recently added parallel execution,with excellent support for SQL standards and mul-tiple useful extensions. It supports nested datastructures both with the json/jsonb data types, aswell as by natively supporting arrays and compositetypes. It has extensive support for traversing andunnesting them. Regarding support for foreign datasources, besides simple filters and projections, thePostgreSQL Foreign Data Wrapper (FDW) interfacecan interact with the optimizer to push down joins

14

Page 15: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

and post-join operations such as aggregations. WithPostgreSQL FDW, it is possible to declare tables forwhich query and manipulation operations are dele-gated on adapters. The wrapper interface includesthe ability to either impose or import a schema forthe foreign tables. Imposing a schema requires theuser to declare data types and structure and it is upto the wrapper to make it fit by using automatic typeconversions as possible. If this automatic process isnot successful the user will need to change the spec-ified data type to provide a closer type match. Thewrapper can also (optionally) advertise the possibil-ity of importing a schema. In this case, the user sim-ply instructs PostgreSQL to import meta-data fromthe wrapper and use it for further operations. Thiscapability is provided by the wrapper and currently,this is only supported for SQL databases, for whichthe schema can be easily queried. Furthermore, Post-greSQL FDW can export the schema of the createdforeign tables. In addition to already existing wrap-pers for many NoSQL data sources, with variable fea-tures and maturity, the Multicorn3 framework allowsexposing the Python scripting language to the devel-oper, to complement SQL and express NoSQL datamanipulation operations.

In terms of our goals, PostgreSQL falls short onautomatically using existing materialized views inqueries. The common workaround is to design queriesbased on views and later decide whether to materi-alize them, which is usable in our scenario. Anotherissue is that schema inference is currently offered forrelational data sources only. The workaround is forthe developer to explicitly provide the foreign tabledefinition.Calcite[1] (in Dremio OSS[7]). The Calcite

SQL compiler, featuring an extensible optimizer, isused in a variety of modern data processing systems.We focus on Dremio OSS as its feature list mostclosely matches our goal. Calcite is designed fromscratch for data integration and focuses on the abil-ity to use the optimizer itself to translate parts of thequery plan to different back end languages and APIs.It also supports nested data types and correspondingoperators. Dremio OSS performs schema inference,

3https://github.com/Kozea/Multicorn

but treats nested structures as opaque and, therefore,does not completely support low-code construction ofunnesting operations, in the sense that the user stillneeds to explicitly handle these. Still, it provides theability to impose a table schema ad-hoc or flexiblyadapt data types which is a desirable feature for over-riding incorrect schema inference. Also, Dremio OSSadds a distributed parallel execution engine, based onthe Arrow columnar format, and a convenient way tomanage materialized views (a.k.a., “reflections”), thatare automatically used in queries. Unfortunately, onecannot define or use indexes on theses views, whichreduces their usefulness in our target application sce-narios.

Although Calcite has a growing user and devel-oper community, its maturity is still far behind Post-greSQL. The variety of adapters for different NoSQLsystems is also lagging behind PostgreSQL FDW, al-though some are highly developed. For instance, theMongoDB adapter in Dremio OSS is able to exten-sively translate SQL queries to MongoDB’s aggrega-tion pipeline syntax, thus being able to push downmuch of the computation and reduce data transfer.The talent and effort needed for exploiting this in ad-ditional data wrappers is, however, substantial. Bothfor Dremio and PostgreSQL, limitations in schemaimposition/inference do not impact querying capa-bilities, only the required talent to use the system.For PostgreSQL FDW, this can be mitigated by ex-tending adapters to improve support for nested datastructures, integrating schema inference/extractiontechniques. Finally, the main drawback of this optionis that, as we observed in preliminary tests, resourceusage and response time for simple queries is muchhigher than for PostgreSQL.Choosing PostgreSQL with FDW. In the end,

we found that our focus on interactive operationalapplications and the maturity of the PostgreSQL op-tion, outweigh, for now, the potential advantagesfrom Calcite’s extensibility.Additional development Completing a proof-of-

concept implementation based on PostgreSQL as amediator requires additional development in the low-code platform itself, an external database connector,and in the wrappers. As examples, we describe sup-port for two NoSQL systems. The first is Cassandra,

15

Page 16: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

a distributed key-value store that has evolved to in-clude a typed schema and secondary indexes. It has,however, only minimal ad-hoc query processing ca-pabilities, restricted to filtering and projection. Thesecond is MongoDB, a schema-less document storethat has evolved to support complex query processingwith either MapReduce or the aggregation pipeline.Both are also widely used in a variety of applications.Schema conversion. In order to sup-

port relational schema introspection, we reusemongodb-schema4, extending it to provide a prob-abilistic schema, with fields and types, for each col-lection in a MongoDB database. Top-level documentfields are mapped as table attributes. When basedon probabilistic schemas, all discovered attributesare included, leaving it to the user/developer to de-cide which attributes to consider. Nested documents’fields are mapped as top-level attributes, named asthe field prefixed with its original path. Nested ar-rays are handled by creating a new table and promot-ing fields of inner documents to top-level attributes.Documents from a given collection become a line ofthe corresponding table (or tables). An alternativewould be to create a denormalized table, as shown inTable 3.Notice that this is equivalent to the result ofa natural join between the corresponding separate ta-bles. However, separate tables fit better what wouldbe expected from a relational database and thus im-prove the low-code experience. It should be pointedout that viewing the original collection as a set ofseparate relational tables has no impact on the per-formance of a query with a join between these tables.The required unnesting directives, using the $unwindpipeline aggregation operator are also generated andadded to the table definition. We also provide theoption, on by default, of adding a column referencingthe _id of the outermost table to all inner tables onschema generation, that can serve as an elementaryforeign key.MongoDB wrapper. There are multiple FDW

implementations for MongoDB. We selected onebased on Multicorn,5 for ease of prototyping, andchange it extensively to include schema introspection

4https://github.com/mongodb-js/mongodb-schema5https://github.com/asya999/yam_fdw

and, taking advantage of aggregation pipeline querysyntax, to allow push-down to work with user sup-plied queries. This is greatly eased by MongoDB’ssyntax for the aggregation pipeline being easily ma-nipulated by programs, by adding additional stages.Cassandra wrapper. We also use a wrapper

based on Multicorn.6. In this case, we add the abilityto use arbitrary Python expressions to compute rowkeys from arbitrary attributes, as in earlier versions ofCassandra it was usual to manually concatenate sev-eral columns. Even if this is no longer necessary inrecent versions of Cassandra, it is still common prac-tice in other NoSQL systems such as HBase. Thecurrently preferred interface to Cassandra, CQL, isnot the best fit for being manipulated by programs,although, being so simple, it can be done with rela-tively small amount of text parsing.Connectors. We implemented custom connec-

tors for each NoSQL store based on the originalPostgreSQL connector. This allows the developerto directly pick the target data store from the plat-form’s visual development environment [15] drop-down menu and provide system specific connectionoptions. It also allows system specific projection andaggregation operators to be handled.Developer platform. The changes needed in the

platform to fully accommodate the integration arethe ability to express nesting and unnesting oper-ators in the data manipulation UI, and to gener-ate SQL queries that contain them when using theNoSQL integration connectors. It is, however, possi-ble to workaround this by configuring multiple flat-tened views of data, as needed, when the schema isintrospected and imported.

6 Use case

The proposed architecture and proof-of-concept in-tegration of the OutSystems low-code platform withNoSQL data stores is evaluated by applying it to asimple application using multiple NoSQL stores.

6https://github.com/rankactive/cassandra-fdw

16

Page 17: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

6.1 Application and workload

We use a subset of the TPC-C [24] database schemaand workload as our application scenario. Note thatwe don’t aim at evaluating database system perfor-mance and don’t expect that the results obtainedhere are valid as such. The reasons for using TPC-C are the following: First, the database schema andoperations are very well known by the database re-search community and in industry, making it easierto follow and understand the examples used. Sec-ond, and most importantly, we make use of py-tpcc,7an implementation of the TPC-C database schemaand workload for multiple NoSQL (and some SQL)systems that exploits the native idioms for each ofthem. This provides us with a Rosetta stone thatwe can use to compare the results obtained with ourproposal. In particular, py-tpcc denormalizes data asfit for NoSQL systems.

Briefly, TPC-C simulates a generic wholesale sup-plier business, with multiple geographically dis-tributed warehouses and associated sales districts.All warehouses stock the same set of items and eachdistrict as a corresponding set of customers. We con-sider only two out of the five transactions:

Stock-Level Determines the number of recentlysold items (i.e., the latest 20 orders) with stocklevel below a user provided threshold.

Order-Status Queries the delivery status for eachof the items in a customer’s last order, wherethe customer is specified either by its primarykey identifier or by its name.

Therefore, from the nine tables in the complete TPC-C database schema, we use only CUSTOMER, OR-DERS, ORDER-LINE, DISTRICT, and STOCK. Wedo however create, populate, and map the entireschema.

6.2 Experimental setup

The implementation of TPC-C on Cassandra in py-tpcc closely follows the relational database schema in

7https://github.com/apavlo/py-tpcc

the original specification. In particular, it uses a sep-arate column family with the corresponding columnsfor each TPC-C table. However, it has two compli-cations: First, all data is stored as UTF-8 stringsinstead of using the actual data types. Second, thekey for each column family is manually assembled bythe application code by concatenating the relevantcolumn values and padding them with zeros. Thesecomplications make it harder to map it to a rela-tional schema. However, these are current practicegiven the limitations of older versions of Cassandra(i.e., before 3.0) and thus make an excellent use case.The same practices are widely used in other key-valuestores, where different application-specific methodsare used to encode keys.

The implementation of TPC-C on MongoDB inpy-tpcc is a good use-case as it does not directlymap each TPC-C table to a separate document col-lection. Instead, it maps the ORDERS for a givenclient as an array nested in each CUSTOMERS doc-ument, and ORDER-LINE as another array nestedin each of the ORDERS. HISTORY is also nested inCUSTOMERS, although we don’t ever read or up-date it in Stock-Level and Order-Status transactions.Other tables are directly mapped to separate doc-ument collections, of which we use DISTRICT andSTOCK. Moreover, py-tpcc also stores all data asUTF-8 strings instead of making use of various datatypes supported by MongoDB.

Our experimental setup is completed with a newimplementation of py-tpcc for PostgreSQL, derivedfrom the SQLite implementation with minimal mod-ifications for connection setup and parameter passing.We use it to exercise the mapping of Cassandra andMongoDB through a wrapper.

6.3 Importing the schema

We start by importing the schema for both Cassandraand MongoDB using the IMPORT FOREIGN SCHEMAstatement. This provides a basic mapping that cannow be refined.

For Cassandra, the main issue is the use of a com-posite key that is created by the application by con-catenating strings. As Figure 10 shows, this can besolved by using an embedded Python snippet that

17

Page 18: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

is copied directly from the original application andthat generates the key for direct lookup when bothcolumns are available. A secondary issue is that con-verting between integers and strings at the query en-gine level avoids that filters are pushed down, as areverse conversion would need to be assumed. This issolved by changing the type of columns to small inte-gers that are then transparently converted to stringswithin the wrapper.

For MongoDB, the schema importer will gener-ate, on request, the table definition in Figure 11 fortable ORDERS, nested in documents in the CUS-TOMER collection of the tpcc database, producinga usable mapping. mname options reference the docu-ment fields in the backing MongoDB collection, CUS-TOMER, referenced in the table options. Columnswhose mname does not reference an ORDERS fieldhave been added by the developer using ALTER TA-BLE. This method can be used if, for example, somecolumns don’t have the expected names (e.g., c_idinstead of o_c_id or to add foreign keys. However,as we intend to use existing application code that as-sumes the SQL schema for TPC-C, we can overridethe definition with different names as shown in Fig-ure 11. Adding such mappings does not require anyprocessing as the backing data is not changed.

6.4 Query execution

The plan for the first query for the Stock-Level trans-action on Cassandra is shown in Figure 12. It can beseen that the filter is successful pushed down to theMulticorn wrapper and then transformed into a fil-ter on the key in the CQL query. Figure 13 showsthe plan for a query in the Order-Status transac-tion running on the MongoDB data store. It showshow projections and selections are pushed down andcombined with the original pipeline query used whendefining the foreign table in Figure 11. In particular,analysing the plan from the bottom up shows that theoptimizer is capable of realizing that the columns re-ferred to in the WHERE statement of the original queryas fields of nested documents are also fields in theouter documents, enabling the filter equivalent to theWHERE statement to be pushed-down to be the firstoperation to be performed, without actually requir-

ing any previous $unwind operations, thus makingthis query very efficient. Also, the last $project op-eration to be performed (appears first in the listing),selects only the fields that match the specification ofthe SELECT statement to be returned, thus limitingdata transfer to the absolute necessary.

7 Related work

Increasing usage of multiple SQL and NoSQLdatabase systems in the same organization, and inparticular by the same application, has sparked aninterest in tools and techniques that make this sim-pler. Moreover, it is increasingly important to enablequeries across different database systems, combiningdata from different sources.

Hadoop MapReduce and Hive support multiple fileformats and data sources such as NoSQL data storesby means of pluggable input format adapters. Theseadapters may discover the schema, including par-titioning, or impose a user defined one where notavailable, and then iterate over data. The Hadoopinput format interface allows projecting and filter-ing to be pushed down thus reducing the amountof data moved. The same approach is used by in-memory MapReduce engines such as Spark [2] andFlink, and also by parallel-distributed query enginessuch as Presto and Impala, to interface with multiplefile formats and data sources. As these systems focuson large scale queries by doing as much as possible ofthe query computation themselves, such that it canbe optimized and parallelized, the ability to exploitparticular querying capabilities and operators of thedata stores is minimal. Moreover, even if the min-imal query latency of the parallel-distributed queryengines is much smaller than that of MapReduce, itstill isn’t adequate to operational workloads.

The alternative is to consider a federated databasesystem with multiple SQL and NoSQL data stores.The SQL/MED standard has support for pushingdown parts of queries for remote execution [14]. How-ever, there is no single query language that can ad-equately express queries over such diversity of pro-cessing systems, i.e., fully exploit their unique char-acteristics, and yet these characteristics are pre-

18

Page 19: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

ALTER FOREIGN TABLE cass.districtALTER COLUMN key OPTIONS (composite’d_id,d_w_id:str(d_id).zfill(5)+str(d_w_id).zfill(5)’),

ALTER COLUMN d_id TYPE SMALLINT,ALTER COLUMN d_w_id TYPE SMALLINT;

Figure 10: Coping with type conversions and a composite column in Cassandra.

DROP FOREIGN TABLE ymdb.orders;CREATE FOREIGN TABLE ymdb.orders (o_all_local numeric OPTIONS (mname ’ORDERS.O_ALL_LOCAL’),o_carrier_id numeric OPTIONS (mname ’ORDERS.O_CARRIER_ID’),o_entry_d timestamp without time zone OPTIONS (mname

’ORDERS.O_ENTRY_D’),o_id numeric OPTIONS (mname ’ORDERS.O_ID’),o_ol_cnt numeric OPTIONS (mname ’ORDERS.O_OL_CNT’),o_c_id numeric OPTIONS (mname ’C_ID’),o_d_id numeric OPTIONS (mname ’C_D_ID’),o_w_id numeric OPTIONS (mname ’C_W_ID’)

)SERVER ymdbserverOPTIONS (collection ’CUSTOMER’, db ’tpcc’, host ’mongo’,auth_db ’admin’, user ’...’, password ’...’,pipe ’[{"$unwind": "$ORDERS"}]’

);

Figure 11: Unnesting and renaming columns from MongoDB.

cisely the reason these are used [Sto15]. Recent re-search addressing these challenges has lead to poly-glot database systems, also known as polystores.

A possible approach is to start with a high-levelSQL query and then compile parts of it to low-levelqueries that can be pushed to the data stores. Theremainder of the query is executed by a traditionalquery engine, being able to provide additional op-erators and to combine data from multiple sources.This has been proposed for distributed file systems,using MapRreduce [5]. The same approach is alsoused by SpliceMachine, that uses the Apache Derbyquery engine and translates parts of a query to HBaseco-processors.8 More recently, it has been proposedthat a full fledged query language for nested [21] orrelational [3] can to large extent be translated to a va-

8https://www.splicemachine.com/product/how-it-works/

riety of target NoSQL data stores with various dataand query models. This uses an optimizer to rewriteparts of the query, driven by cost estimates, whilethe remainder is executed by a general purpose queryengine. The Calcite open source system provides animplementation that is being used, for instance, inDremio9, that provides columnar distributed-parallelquery execution and materialization. F1 Query [20],is a federated query processing platform that exe-cutes SQL queries against data stored in different file-based formats as well as different storage systems atGoogle. These approaches do not however exploit na-tive query processing capabilities when they exposespecialized indexing and query operators. They alsodon’t make it easy to circumvent the translation layerto access native queries. Moreover, those that rely onMapReduce jobs increases end-to-end response time.

9https://github.com/dremio/dremio-oss

19

Page 20: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

Title Suppressed Due to Excessive Length 23

psql> EXPLAIN ANALYZE SELECT D_NEXT_O_ID FROM cass.DISTRICT WHERE D_W_ID = 1 AND D_ID = 1;QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------Foreign Scan on DISTRICT (cost=20.00..100.00 rows=1 width=4) (actual time=3.149..3.174 rows=1 loops=1)

Filter: ((D_W_ID = 1) AND (D_ID = 1))Multicorn: SELECT D_W_ID,D_ID,D_NEXT_O_ID,KEY FROM tpccks.district WHERE key = %sMulticorn: [bytearray(b’0000100001’)]Multicorn: [’key’]

Planning time: 51.485 msExecution time: 3.398 ms(7 rows)

Fig. 9: Example of filter push-down to Cassandra with a composite key.

psql> EXPLAIN ANALYZE SELECT OL_SUPPLY_W_ID, OL_I_ID, OL_QUANTITY, OL_AMOUNT, OL_DELIVERY_DFROM ymdb.ORDER_LINE WHERE OL_W_ID = ’1’ AND OL_D_ID = ’1’ AND OL_O_ID = ’1’;

QUERY PLAN-----------------------------------------------------------------------------------------------------------------------Limit (cost=43201500.00..43201500.00 rows=1 width=72) (actual time=1.272..1.274 rows=1 loops=1)

-> Sort (cost=43201500.00..43202250.00 rows=300000 width=72) (actual time=1.271..1.271 rows=1 loops=1)Sort Key: o_id DESCSort Method: quicksort Memory: 25kB-> Foreign Scan on orders (cost=20.00..43200000.00 rows=300000 width=72) (actual time=1.214..1.224 rows=1

loops=1)Filter: ((o_w_id = ’1’::numeric) AND (o_d_id = ’1’::numeric) AND (o_c_id = ’1’::numeric))Multicorn: {’$project’: {’o_carrier_id’: True, ’o_id’: True, ’o_entry_d’: True}}Multicorn: {’$match’: {’o_c_id’: {’$eq’: 1.0}, ’o_d_id’: {’$eq’: 1.0}, ’o_w_id’: {’$eq’: 1.0}}}Multicorn: {’$project’: {’o_d_id’: ’$C_D_ID’, ’o_carrier_id’: ’$ORDERS.O_CARRIER_ID’, ’o_c_id’: ’$C_ID’,

’o_id’: ’$ORDERS.O_ID’, ’o_entry_d’: ’$ORDERS.O_ENTRY_D’, ’o_w_id’: ’$C_W_ID’}}Multicorn: {’$unwind’: ’$ORDERS’}

Planning time: 37.341 msExecution time: 1.453 ms(12 rows)

Fig. 10: Example of filter push-down to MongoDB on a nested table.

7 Related work

Increasing usage of multiple SQL and NoSQL database systems in the sameorganization, and in particular by the same application, has sparked the intereston tools and techniques that make this simpler. Moreover, it is increasinglyimportant to enable queries across different database systems, combining datafrom different sources.

Hadoop MapReduce and Hive support multiple file formats and data sourcessuch as NoSQL data stores by means of pluggable input format adapters. Theseadapters may discover the schema, including partitioning, or impose a user de-fined one where not available, and then iterate over data. The Hadoop inputformat interface allows projecting and filtering to be pushed down thus reducingthe amount of data moved. The same approach is used by in-memory MapRe-duce engines such as Spark [1] and Flink, and also by parallel-distributed queryengines such as Presto and Impala, to interface with multiple file formats anddata sources. As these systems focus on large scale queries by doing as much aspossible of the query computation themselves, such that it can be optimized andparallelized, the ability to exploit particular querying capabilities and operatorsof the data stores is minimal. Moreover, even if the minimal query latency ofthe parallel-distributed query engines is much smaller than that of MapReduce,it still isn’t adequate to operational workloads.

The alternative is to consider a federated database system with multipleSQL and NoSQL data stores. The SQL/MED standard has support for pushingdown parts of queries for remote execution [11]. However, there is no single querylanguage that can adequately express queries over such diversity of processing

Figure 12: Example of filter push-down to Cassandra with a composite key.

Bulding a Polyglot Data Access Layer for Low-Code 13Title Suppressed Due to Excessive Length 23

psql> EXPLAIN ANALYZE SELECT D_NEXT_O_ID FROM cass.DISTRICT WHERE D_W_ID = 1 AND D_ID = 1;QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------Foreign Scan on DISTRICT (cost=20.00..100.00 rows=1 width=4) (actual time=3.149..3.174 rows=1 loops=1)

Filter: ((D_W_ID = 1) AND (D_ID = 1))Multicorn: SELECT D_W_ID,D_ID,D_NEXT_O_ID,KEY FROM tpccks.district WHERE key = %sMulticorn: [bytearray(b’0000100001’)]Multicorn: [’key’]

Planning time: 51.485 msExecution time: 3.398 ms(7 rows)

Fig. 9: Example of filter push-down to Cassandra with a composite key.

psql> EXPLAIN ANALYZE SELECT OL_SUPPLY_W_ID, OL_I_ID, OL_QUANTITY, OL_AMOUNT, OL_DELIVERY_DFROM ymdb.ORDER_LINE WHERE OL_W_ID = ’1’ AND OL_D_ID = ’1’ AND OL_O_ID = ’1’;

QUERY PLAN-----------------------------------------------------------------------------------------------------------------------Limit (cost=43201500.00..43201500.00 rows=1 width=72) (actual time=1.272..1.274 rows=1 loops=1)

-> Sort (cost=43201500.00..43202250.00 rows=300000 width=72) (actual time=1.271..1.271 rows=1 loops=1)Sort Key: o_id DESCSort Method: quicksort Memory: 25kB-> Foreign Scan on orders (cost=20.00..43200000.00 rows=300000 width=72) (actual time=1.214..1.224 rows=1

loops=1)Filter: ((o_w_id = ’1’::numeric) AND (o_d_id = ’1’::numeric) AND (o_c_id = ’1’::numeric))Multicorn: {’$project’: {’o_carrier_id’: True, ’o_id’: True, ’o_entry_d’: True}}Multicorn: {’$match’: {’o_c_id’: {’$eq’: 1.0}, ’o_d_id’: {’$eq’: 1.0}, ’o_w_id’: {’$eq’: 1.0}}}Multicorn: {’$project’: {’o_d_id’: ’$C_D_ID’, ’o_carrier_id’: ’$ORDERS.O_CARRIER_ID’, ’o_c_id’: ’$C_ID’,

’o_id’: ’$ORDERS.O_ID’, ’o_entry_d’: ’$ORDERS.O_ENTRY_D’, ’o_w_id’: ’$C_W_ID’}}Multicorn: {’$unwind’: ’$ORDERS’}

Planning time: 37.341 msExecution time: 1.453 ms(12 rows)

Fig. 10: Example of filter push-down to MongoDB on a nested table.

7 Related work

Increasing usage of multiple SQL and NoSQL database systems in the sameorganization, and in particular by the same application, has sparked the intereston tools and techniques that make this simpler. Moreover, it is increasinglyimportant to enable queries across different database systems, combining datafrom different sources.

Hadoop MapReduce and Hive support multiple file formats and data sourcessuch as NoSQL data stores by means of pluggable input format adapters. Theseadapters may discover the schema, including partitioning, or impose a user de-fined one where not available, and then iterate over data. The Hadoop inputformat interface allows projecting and filtering to be pushed down thus reducingthe amount of data moved. The same approach is used by in-memory MapRe-duce engines such as Spark [1] and Flink, and also by parallel-distributed queryengines such as Presto and Impala, to interface with multiple file formats anddata sources. As these systems focus on large scale queries by doing as much aspossible of the query computation themselves, such that it can be optimized andparallelized, the ability to exploit particular querying capabilities and operatorsof the data stores is minimal. Moreover, even if the minimal query latency ofthe parallel-distributed query engines is much smaller than that of MapReduce,it still isn’t adequate to operational workloads.

The alternative is to consider a federated database system with multipleSQL and NoSQL data stores. The SQL/MED standard has support for pushingdown parts of queries for remote execution [11]. However, there is no single querylanguage that can adequately express queries over such diversity of processing

Fig. 4: Example of filter push-down to Cassandra with a composite key.

psql> EXPLAIN ANALYZE SELECT OL_SUPPLY_W_ID, OL_I_ID, OL_QUANTITY, OL_AMOUNT, OL_DELIVERY_DFROM ymdb.ORDER_LINE WHERE OL_W_ID = ’1’ AND OL_D_ID = ’1’ AND OL_O_ID = ’1’;

QUERY PLAN------------------------------------------------------------------------------------------------------------Foreign Scan on order_line (cost=20.00..57600000.00 rows=300000 width=136) (actual time=1.399..1.672 rows=9loops=1)

Filter: ((ol_w_id = ’1’::numeric) AND (ol_d_id = ’1’::numeric) AND (ol_o_id = ’1’::numeric))Multicorn: {’$project’: {’ol_amount’: True, ’ol_delivery_d’: True, ’ol_i_id’: True, ’ol_supply_w_id’: True,

’ol_quantity’: True}}Multicorn: {’$project’: {’ol_amount’: ’$ORDERS.ORDER_LINE.OL_AMOUNT’, ’ol_w_id’: ’$C_W_ID’,

’ol_delivery_d’: ’$ORDERS.ORDER_LINE.OL_DELIVERY_D’, ’ol_o_id’: ’$ORDERS.O_ID’, ’ol_d_id’: ’$C_D_ID’,’ol_i_id’: ’$ORDERS.ORDER_LINE.OL_I_ID’, ’ol_supply_w_id’: ’$ORDERS.ORDER_LINE.OL_SUPPLY_W_ID’,’ol_quantity’: ’$ORDERS.ORDER_LINE.OL_QUANTITY’}}

Multicorn: {u’$unwind’: u’$ORDERS.ORDER_LINE’}Multicorn: {’$match’: {’ORDERS.O_ID’: {’$eq’: 1.0}}}Multicorn: {u’$unwind’: u’$ORDERS’}Multicorn: {’$match’: {’C_D_ID’: {’$eq’: 1.0}, ’C_W_ID’: {’$eq’: 1.0}, ’ORDERS.O_ID’: {’$eq’: 1.0}}}

Planning time: 28.739 msExecution time: 1.818 ms(10 rows)

Fig. 5: Example of filter push-down to MongoDB on a nested table.

$project directive ensures the indicated fields, which are primary keys for tablesi are included in the

6 Related work

Increasing usage of multiple SQL and NoSQL database systems in the sameorganization, and in particular by the same application, has sparked the intereston tools and techniques that make this simpler. Moreover, it is increasinglyimportant to enable queries across different database systems, combining datafrom different sources.

Hadoop MapReduce and Hive support multiple file formats and data sourcessuch as NoSQL data stores by means of pluggable input format adapters. Theseadapters may discover the schema, including partitioning, or impose a user de-fined one where not available, and then iterate over data. The Hadoop inputformat interface allows projecting and filtering to be pushed down thus reducingthe amount of data moved.

The alternative is to consider a federated database system with multipleSQL and NoSQL data stores. The SQL/MED standard has support for pushingdown parts of queries for remote execution [12]. However, there is no single querylanguage that can adequately express queries over such diversity of processing

Figure 13: Example of filter push-down to MongoDB on a nested table.

An alternative approach that focuses on exploit-ing the specialized query capabilities of each NoSQLdata store is exemplified by BigDawg [8, 23], that ad-mits that a systems can be mapped to various datamodels and query languages instead of a single uni-form language. This would reduce the conceptualgap between generic languages and target data stores,however, still resulting in a wide variety of conceptsto be introduced into the low-code platform. It hasalso been proposed that intermediate results can bebuilt using a native query on some system and thentransferred to a second system for a second compu-tation step, using a different query system [12], sug-

gesting that materialization can be an important toolin bridging the gap between different data and querymodels.

The CloudMdsQl approach [11] proposes an exten-sion of SQL where parts of the query can be trans-lated to native queries, but also allowing a query toembed ad hoc views defined using native query lan-guage snippets. This allows native query capabilitiesto be fully used when necessary. It also circumventsthe need for introspection, as schema is imposed bythe query itself. The main advantage is that someof the query can be optimized globally and executedby the top level query engine. For instance, it uses

20

Page 21: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

BIND JOIN [9] to reduce data transfer across differ-ent data stores. This is the closest to our proposal,although we provide schema introspection, at least asa starting point for the developer, and use a standardSQL query engine.

8 Lessons Learned

We discussed the challenges in integrating a vari-ety of NoSQL data stores with the OutSystems low-code platform. This is achieved by a SQL queryengine that federates multiple NoSQL sources andcomplements their functionality, using PostgreSQLwith Foreign Data Wrappers as a proof-of-conceptimplementation. It allowed us to learn some lessonsabout NoSQL systems and to propose a good trade-off between integration transparency and the abilityto take full advantage of each systems’ particularities.At the same time, it does not constrain the developer,that can override introspection and query generationby using embedded native query language snippetswhen required. In some cases, this doesn’t even con-strain the ability for the query engine to globally op-timize the query. The main lessons learned for a low-code platform provider are:1. Target an extended relational model. The

relational data model when extended with nestedcomposite data types such as maps and arrays cansuccessfully map the large majority of NoSQL datamodels with minimal conversion or conceptual over-head. Moreover, when combined with flatten andunflatten operators, the relational query model canactually operate on such data and represent a largeshare of target query operations. This is very rele-vant, as it provides a small set of additional conceptsthat have to be added to the low-code platform or,preferably, none at all as unnesting is done when im-porting the schema.2. A query engine is needed. Due to the

varying nature of query capabilities in different datasources, a query engine that can perform various com-putations is necessary to avoid that developers haveto constantly mind these differences. This is trueeven for querying a single source at a time.

For polyglot developers, we underline the lessons

from CloudMdsQl [11] with one notable exception:3. Basic schema discovery with overrides is

needed. Although CloudMdsQl [11] has shown thatit is possible to build a polyglot query engine with-out schema discovery, by imposing ad-hoc schemason native queries, it severely restricts its usefulnessin the context of a low-code platform. However, aftergetting started with automatically inferred schema, itis useful to allow the developer to impose additionalstructure such as composite primary keys in key valuestores.4. Embedded scripting is required. Although

many data manipulation operations could be donein SQL at the query engine, embedding snippets of ageneral purpose scripting language allows direct reuseof existing code and reduces the need for talent totranslate them. Together with the ability to overrideautomatic discovery, this is key to ensuring that thedeveloper never hits a wall imposed by the platform.5. Materialized view substitution is desir-

able. Although our proof-of-concept implementationdoes not include it, this is the main feature from theCalcite-based alternative that is missing. The abilityto define different native queries as materializationsof various sub-queries is the best way to encapsulatealternative access paths encoded in a data-store spe-cific language.6. Combining foreign tables with scripting is

surprisingly effective. Although CloudMdsQl [11]proposed its own query engine, a standard SQL en-gine with federated query capabilities, when com-bined with a scripting layer for developing wrapperssuch as Multicorn, is surprisingly effective in express-ing queries and supporting optimizations.

Finally, the main lessons learned for NoSQL datastore providers are:7. A NoSQL query interface should be tar-

geted at machines, not only at humans. NoSQLsystems such as MongoDB or Elasticsearch, that ex-pose a query model based on an operator pipeline, arevery friendly to integration as proposed. In detail, itallows generating native queries from SQL operatorsor to combine partially hand-written code with gen-erated code. Ironically, systems that expose a sim-plistic SQL like language that is supposed to be moredeveloper friendly, such as Cassandra, make it harder

21

Page 22: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

to integrate as queries in these languages are not aseasily composed.8. Focus on combining query fragments. It

might be tempting to overlook some optimizationsthat are irrelevant when a human is writing a com-plete query, e.g., as pushing down $match in a Mon-goDB pipeline. However, these optimizations arefairly easy to achieve and greatly simplify combiningpartially machine generated queries with developerwritten queries.

References

[1] Apache Calcite – Documentation. https://calcite.apache.org/docs/, accessed: 2020-02-27

[2] Armbrust, M., Xin, R., Lian, C., Huai, Y.,Liu, D., Bradley, J., Meng, X., Kaftan, T.,Franklin, M., Ghodsi, A., Zaharia, M.: SparkSQL: Relational data processing in Spark. In:Proceedings of the 2015 ACM SIGMOD Inter-national Conference on Management of Data.pp. 1383–1394. SIGMOD ’15, ACM (2015).https://doi.org/10.1145/2723372.2742797,http://doi.acm.org/10.1145/2723372.2742797

[3] Begoli, E., Camacho-Rodríguez, J., Hyde, J.,Mior, M.J., Lemire, D.: Apache calcite: Afoundational framework for optimized queryprocessing over heterogeneous data sources. In:Proceedings of the 2018 International Confer-ence on Management of Data. pp. 221–230. SIG-MOD ’18, ACM, New York, NY, USA (2018).https://doi.org/10.1145/3183713.3190662,http://doi.acm.org/10.1145/3183713.3190662

[4] Castrejón, J., Vargas-Solar, G., Collet, C.,Lozano, R.: Exschema: Discovering and main-taining schemas from polyglot persistence appli-cations. In: 2013 29th IEEE International Con-ference on Software Maintenance. pp. 496–499.IEEE (2013)

[5] DeWitt, D., Halverson, A., Nehme, R.,Shankar, S., Aguilar-Saborit, J., Avanes, A.,Flasza, M., Gramling, J.: Split query process-ing in polybase. In: Proceedings of the 2013ACM SIGMOD International Conference onManagement of Data. pp. 1255–1266. SIG-MOD ’13, ACM, New York, NY, USA (2013).https://doi.org/10.1145/2463676.2463709,http://doi.acm.org/10.1145/2463676.2463709

[6] DiScala, M., Abadi, D.J.: Automatic generationof normalized relational schemas from nestedkey-value data. In: Proceedings of the 2016 In-ternational Conference on Management of Data.pp. 295–310. SIGMOD ’16, ACM (2016)

[7] Dremio – The data lake engine. https://docs.dremio.com, accessed: 2020-02-27

[8] Duggan, J., Elmore, A., Stonebraker, M.,Balazinska, M., Howe, B., Kepner, J., Mad-den, S., Maier, D., Mattson, T., Zdonik,S.: The bigdawg polystore system. ACMSIGMOD Record 44(2), 11–16 (Aug 2015).https://doi.org/10.1145/2814710.2814713,http://doi.acm.org/10.1145/2814710.2814713

[9] Haas, L., Kossmann, D., Wimmers, E., Yang, J.:Optimizing queries across diverse data sources.In: Proceedings of the 23rd International Con-ference on Very Large Data Bases. pp. 276–285.VLDB ’97, Morgan Kaufmann Publishers Inc.,San Francisco, CA, USA (1997), http://dl.acm.org/citation.cfm?id=645923.670995

[10] ISO/IEC: Information technology - Databaselanguages - SQL - Part 9: Management of Ex-ternal Data (SQL/MED). ISO/IEC standard(2016)

[11] Kolev, B., Valduriez, P., Bondiombouy, C.,Jiménez-Peris, R., Pau, R., Pereira, J.:CloudMdsQL: querying heterogeneous clouddata stores with a common language. Dis-tributed and Parallel Databases pp. 1–41 (2015).https://doi.org/10.1007/s10619-015-7185-y

22

Page 23: arXiv:2004.13495v1 [cs.DB] 28 Apr 2020 · 2020-04-29 · MongoDB can store relational data by introducing reference fields thatholdthevalueinthe_id ofthereferenceddocu-ment. Nonetheless,MongoDBencouragesdenormal-ization

[12] LeFevre, J., Sankaranarayanan, J., Hacigumus,H., Tatemura, J., Polyzotis, N., Carey, M.:Miso: Souping up big data query processingwith a multistore system. In: Proceedings of the2014 ACM SIGMOD International Conferenceon Management of Data. pp. 1591–1602. SIG-MOD ’14, ACM, New York, NY, USA (2014).https://doi.org/10.1145/2588555.2588568,http://doi.acm.org/10.1145/2588555.2588568

[13] Lima, A.: OutSystems Platform - architectureand infrastructure overview. Tech. rep., OutSys-tems (2013)

[14] Melton, J., Michels, J.E., Josifovski, V., Kulka-rni, K., Schwarz, P.: Sql/med: A status report.ACM SIGMOD Record 31(3), 81–89 (Sep 2002).https://doi.org/10.1145/601858.601877, http://doi.acm.org/10.1145/601858.601877

[15] Development and deploymentenvironments. https://www.outsystems.com/evaluation-guide/outsystems-tools-and-components/#1,accessed: 2020-03-02

[16] Outsystems architecture. https://www.outsystems.com/evaluation-guide/outsystems-architecture/, accessed: 2020-03-02

[17] PostgreSQL foreign data wrappers.https://wiki.postgresql.org/wiki/Foreign_data_wrappers, accessed: 2020-02-27

[18] Ruiz, D.S., Morales, S.F., Molina, J.G.: Infer-ring versioned schemas from nosql databases andits applications. In: International Conferenceon Conceptual Modeling. pp. 467–480. Springer(2015)

[19] Rymer, J.R., Mines, C., Vizgaitis, A., Reese, A.:The forrester waveTM: Low-code development– platforms for ad&d pros, q4 2017. ForresterResearch, Inc (Oct 2017)

[20] Samwel, B., Cieslewicz, J., Handy, B., Govig,J., Venetis, P., Yang, C., Peters, K., Shute,J., Tenedorio, D., Apte, H., Weigel, F., Wil-hite, D., Yang, J., Xu, J., Li, J., Yuan, Z.,Chasseur, C., Zeng, Q., Rae, I., Biyani, A.,Harn, A., Xia, Y., Gubichev, A., El-Helw,A., Erling, O., Yan, Z., Yang, M., Wei, Y.,Do, T., Zheng, C., Graefe, G., Sardashti,S., Aly, A.M., Agrawal, D., Gupta, A.,Venkataraman, S.: F1 query: Declarativequerying at scale. Proceedings of the VLDBEndowment 11(12), 1835–1848 (Aug 2018).https://doi.org/10.14778/3229863.3229871,https://doi.org/10.14778/3229863.3229871

[21] Seco, J.C., Lourenço, H., Ferreira, P.: Acommon data manipulation language for nesteddata in heterogeneous environments. In: Pro-ceedings of the 15th Symposium on DatabaseProgramming Languages. pp. 11–20. DBPL2015, ACM, New York, NY, USA (2015).https://doi.org/10.1145/2815072.2815074,http://doi.acm.org/10.1145/2815072.2815074

[22] Spoth, W., Xie, T., Kennedy, O., Yang, Y.,Hammerschmidt, B., Liu, Z.H., Gawlick, D.:Schemadrill: Interactive semi-structured schemadesign. In: Proceedings of the Workshop onHuman-In-the-Loop Data Analytics. p. 11. ACM(2018)

[23] Stonebraker, M.: The case for polystores. ACMSIGMOD blog (Jul 2015), http://wp.sigmod.org/?p=1629

[24] Transaction Processing Performance Council(TPC): TPC Benchmark C - Standard Speci-fication, revision 5.0 edn. (2001)

[25] Vincent, P., Iijima, K., Driver, M., Wong, J.,Natis, Y.: Magic quadrant for enterprise low-code application platforms. Gartner (Aug 2019)

23