How Drill Enriches Self-Service Analytics · have flat and fixed schemas. For example, a lot of big...
Transcript of How Drill Enriches Self-Service Analytics · have flat and fixed schemas. For example, a lot of big...
How Drill Enriches Self-Service Analytics The Added Value of a SQL-on-Everything Engine
A Whitepaper
Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy
November 2015 Sponsored by
Copyright © 2015 R20/Consultancy. All rights reserved. Apache Hadoop, HBase and Hadoop are trademarks of the Apache Software Foundation and not affiliated with MapR Technologies Inc.. Trademarks of companies referenced in this document are the sole property of their respective owners.
Copyright © 2015 R20/Consultancy, all rights reserved.
Table of Contents 1 Introduction 1
2 The Ever-Changing World of Self-Service Analytics 1
3 Are SQL-on-Hadoop Engines the Solution? 3
4 Opening Up Any Data Source for Analysis with Apache Drill 5
5 The Architecture of Apache Drill 8
6 Use Cases for Apache Drill 11
About the Author Rick F. van der Lans 13
About MapR Technologies, Inc. 13
How Drill Enriches Self-Service Analytics 1
Copyright © 2015 R20/Consultancy, all rights reserved.
1 Introduction More and more business analysts and data scientists don’t restrict themselves anymore to internally produced data that comes from IT‐managed production systems. For their analysis they use all the data they can lay their hands on and that includes external data sources. Especially tech‐savvy analysts who obtain data from the internet (such as research results), access social media data, analyze open data and public data, get files with analysis results from colleagues, and so on. They mix this external data with internal data to get the most complete and accurate business insights. Unfortunately, not all of the external data has a schema and a simple structure. In this case, the analyst can’t import that data into his or her favorite analytical tool. That data is out of reach for the analyst. In such situations, the analyst must lean on IT to ask them to import the data into some SQL table. Developing such a program can take IT quite some time as IT is typically backlogged, which stalls the analysis process considerably (maybe even with weeks). This whitepaper describes Apache Drill. Drill offers SQL access to most of the classic and new data sources, including Hadoop, MongoDB, JSON, cloud storage, and so on. These data sources can even be accessed if no schema for the data exists and if the data is not flat, but hierarchical and contains repeating groups, and even when each record in a table has a somewhat different data structure. Drill is an example of a SQL‐on‐Everything solution. Analysts don’t have to ask IT for assistance. Analysts can use Drill against any kind of data source as Drill discovers what the structure of the data is. This enriches analytical capabilities and improves self‐service analytics. The whitepaper describes the following topics:
Why do analysts need access to any type of data?
What do we mean by variable and complex data structures?
Can SQL‐on‐Hadoop solutions help with accessing any data source?
How does Drill make it possible to make any data source available for analytics?
What does the overall architecture of Drill look like?
What are the typical use‐cases for Drill?
2 The Ever-Changing World of Self-Service Analytics
Self-Service Analytics In The Beginning – There was a time when analysts were very restricted when accessing data sources. They could start up their reporting tools and invoke one of the predefined reports, but that was it. Self‐service was restricted to the report they wanted to see at that particular moment in time.
Mixing internal with external data to get the most complete and
accurate business insights.
Drill is a SQL‐on‐Everything
solution.
How Drill Enriches Self-Service Analytics 2
Copyright © 2015 R20/Consultancy, all rights reserved.
We’ve come a long way since then. Years ago, powerful self‐service BI tools, such as Tableau, QlikView and Spotfire, were introduced. These tools offer analysts total freedom in analyzing data. The financial success of these respective vendors clearly show their acceptance by the market. Even with these powerful tools, analysts still remain restricted in their analytical capabilities. They can usually only analyze data that is made available by the IT department and that has modeled with a well‐defined schema. IT usually creates several data marts in which the data is stored which analysts are allowed to use. So, analysis is confined to the data structures. Relationships, and aggregation levels defined by IT. If analysts discover a new data source that can be of interest and that has to be integrated with the data in the data mart, they must ask IT to develop the logic to take that data source and integrate it with the existing data. This evidently takes time. In fact, it can even take weeks before that new data source is available for analysis.
Self-Service Analytics Progresses – Lately, self‐service tools have been extended with more advanced features for analysts to easily integrate data sources themselves. These are called data blending or data wrangling features. These features allow analysts to develop the integration logic themselves. So, there is no need to wait for the IT department to assist. Tools such as Alteryx have become quite popular because of their data blending features. The risk of this approach is that analysts make incorrect assumptions when interpreting the data stored in the new data source, and that they, for example, join the new data source on the wrong columns. Data preparation functionality helps with that, for example by analyzing the data beforehand and suggesting to the analysts what the tool thinks is the best way to join the new data source. This type of functionality is a real improvement for the analysts. It allows them to work quickly with new data sources even if they haven’t seen the file before.
Dealing with Variable and Complex Data Structures – But the bar for self‐service analytics keeps being raised. The limitation of most tools is that they can only handle data with a simple data structure; each table has a fixed and flat schema. Fixed means that each record in a table has the same schema, meaning each record has the same set of columns and each column has the same data type. A flat schema means that no record contains hierarchical data structures or repeating data structures. Many data sources support fixed and flat schemas. For example, all the tables stored in SQL databases have fixed and flat data structures. Each record in a SQL table has the same set of columns and columns don’t have some form of hierarchy. The same applies for many CSV and TSV files and classic sequential files. What has changed is that more and more data is stored in data storage technologies that don’t always have flat and fixed schemas. For example, a lot of big data is stored in the form of JSON documents, XML documents, MongoDB databases, Apache HBase databases, and Hadoop files using the Parquet and AVRO file formats. What all these systems have in common is that the data structures don’t have to be fixed
Analytics of data is confined by the data marts.
Self‐service tools offer data blending features for self‐
service integration.
Many analytical tools can only access tables with fixed and flat schemas.
How Drill Enriches Self-Service Analytics 3
Copyright © 2015 R20/Consultancy, all rights reserved.
(each record can have a different set of columns and may contain repeating sets of values sometimes called arrays) and don’t have to be flat (they may contain hierarchies). For example, the following JSON example shows three records in some data storage system:
{ "number" : "6", "name" : "Manzarek", "initials": "R", "street" : "Haseltine Lane", "town" : "Phoenix" } { "number" : "8", "name" : "Young", "initials": "P", "street" : "Brownstreet", "mobile" : "1234567" } { "number" : "15", "name" : "Metheny", "initials": "M", "province": "South" }
Each record has a JSON structure. All the records look alike, but they’re all a little different. So, this data source does not have a fixed schema, but a variable schema. Sometimes referred to as schema‐free or schema‐less. Records may also contain repeated sets of values as the following example shows. Here, each employee can be assigned to a number of projects. Employee 15 is assigned to three projects: ACP3, HHGT, and X456. This is clearly an example of a variable schema.
{ "employee" : { "number" : "15", "projects": [ { "name": "ACP3" }, { "name": "HHGT" }, { "name": "X456" } ] } }
In the next example, three records are shown in which each one contains two hierarchies. The element name contains the sub‐elements lastname and initials, and the address element contains street, houseno, postcode, and town. This is not a flat schema but a complex schema.
{ "employee" : { "number" : "6", "name" : { "lastname": "Manzarek", "initials": "R" }, "address" : { "street" : "Haseltine Lane", "houseno" : "80", "postcode": "1234KK", "town" : "Stratford" } } }
Data sources can have a variable schema.
Data sources with complex schemas contain hierarchical data
structures.
How Drill Enriches Self-Service Analytics 4
Copyright © 2015 R20/Consultancy, all rights reserved.
{ "employee" : { "number" : "8", "name" : { "lastname": "Young", "initials": "N" }, "address" : { "street" : "Brownstreet", "houseno" : "80", "province": "ZH", "town" : "Boston" } } } { "employee" : { "number" : "15", "name" : { "lastname": "Metheny", "initials": "M", "code" : "45" } } }
Note that in JSON and XML the structure (metadata) of each value is stored together with the data itself. In other words, data and metadata are stored together. This means that the data storage system understands the structure of the data. Many self‐service tools are not able to process data with variable and complex data structures. They are not capable of processing the hierarchical structures intelligently and they would not know how to process tables where each record can have a (slightly) set of columns. The consequence is that if analysts want to include this data in their analysis, they must ask IT to transform the data in a form that can be processed by the analytical tool. Informally said, IT is asked to “flatten” the data. This would delay analysis significantly. Not because IT is slow, but because it’s a lot of work to take a data source with a variable and complex data structure and turn that into a flat and fixed data structure. Complex ETL programs must be written and tested; see Figure 1.
Figure 1 Complex ETL programs must be developed to flatten the variable and complex data structures.
In addition, every time a new version of the data source becomes available, IT has to check whether new columns have been added in some records. If so, the ETL program must be updated accordingly and the database structure must be extended to make room for these new columns. This is a tedious and maintenance‐intense operation.
3 Are SQL-on-Hadoop Engines the Solution? Many SQL‐on‐Hadoop engines are available for accessing data stored in Hadoop files using the familiar SQL interface. Examples are Apache Hive, Impala, and Pivotal Hawq. Because most of the new data is
How Drill Enriches Self-Service Analytics 5
Copyright © 2015 R20/Consultancy, all rights reserved.
being stored in Hadoop files, it’s worth investigating if these engines allow access to complex and variable data structures.
SQL-on-Hadoop and Non-Hadoop Data – Being able to access data stored in Hadoop files is of great value to many analysts, because, as indicated, so much new data is being stored in Hadoop. In the early days of Hadoop, files could only be accessed through interfaces such as MapReduce, HBase, and Pig. Nowadays, SQL‐on‐Hadoop engines exist to make access to Hadoop files easier. Having a SQL interface to all the Hadoop files really opens up that data to an enormous amount of reporting and analytics tools and also an enormous number of users. So, SQL‐on‐Hadoop engines help with the problem described in the previous section. However, they only offer a limited solution. Allowing easy access to data in Hadoop is very valuable, but there are other data sources as well that should be taken into consideration. For example, massive transactional systems have been developed with MongoDB that contain very valuable data for analytics, and large quantities of open data is stored in files using JSON data structures or simple comma‐separated file structures. And don’t underestimate the amount of valuable data stored in Excel spreadsheets. If an SQL‐on‐Hadoop engine doesn’t support access to data stored outside Hadoop, then that data source must be copied to Hadoop before it can be analyzed. This copying takes time and requires the development of dedicated ETL‐like code.
SQL-on-Hadoop and Complex and Variable Data Structures – An additional limitation of several SQL‐on‐Hadoop engines is that they can only query Hadoop files that contain fixed and flat data structures. Unfortunately, not all data stored in Hadoop is fixed and flat. For example, if Parquet or AVRO file formats are used, the data can have a hierarchical structure. Again, in such a situation the data must be flattened and copied to another Hadoop file before it can be queried with a SQL‐on‐Hadoop engine. Be careful when vendors claim their SQL engines can access Hadoop files. It’s not that their tools can’t, but it doesn’t mean that they can intelligently access all the files. Restrictions imply.
4 Opening Up Any Data Source for Analysis with Apache Drill A special SQL‐on‐Hadoop engine is Apache Drill. The current version is 1.2. From day one, Drill has been designed and optimized to access any type of data source, among which are Hadoop and JSON files, and to be able to work with complex and variable data structures. It has been designed for users who need to analyze data sources quickly and don’t have the time to wait for IT assistance. This section describes some of the distinguishing features of Drill.
Not all the data is stored in Hadoop.
Not all SQL‐on‐Hadoop engines can access
complex and variable data structures in Hadoop.
Drill has been designed to access any type of data
source, not just flat data in Hadoop.
How Drill Enriches Self-Service Analytics 6
Copyright © 2015 R20/Consultancy, all rights reserved.
Accessing Any Data Source – Drill can access any of the following data sources:
CSV and TSV files
Hadoop files with Parquet and AVRO file formats
MongoDB databases
Apache HBase databases
SQL database servers through an ODBC/JDBC interface
Files with JSON data structures
SQL‐on‐Hadoop engines such as Apache Hive and Impala Besides supporting access to all these data sources, users can write joins to integrate data from all these data sources. Conclusion, there is rarely ever a need to copy data from a data source to Hadoop to make it accessible by Drill.
No Need to Analyze the Schema – Most SQL database servers and SQL‐on‐Hadoop engines need to have access to the schema definition of the data source that’s being accessed. For example, the Oracle database server interrogates the built‐in catalog before a query is processed. It needs to know the names of the columns, the available keys, the data types, statistical data on the cardinality of tables and the distribution of values in columns, and so on. Only then can the query be compiled into an execution plan to access the data. This means that the structure of the query result is known even before one record is retrieved from the database. Drill doesn’t need access to the schema definition of the data source it’s accessing. It doesn’t need to know the structure of the tables nor does it need statistical data. It goes straight against the data. The schema of the query result is therefore not known in advance. It’s built‐up and derived when data comes back from the data source. During the processing of all the data, the schema of the query result is continuously adapted. So, the schema of a Hadoop file (or any data source) doesn’t have to be documented in Apache Hive to make it accessible for Drill. For example, the three records shown in Section 2 can be accessed with the following Drill query. Nowhere is the schema of this file is defined.
SELECT * FROM dfs.`example3.json`;
The result:
+---------+-----------+-----------+-----------------+----------+----------+-----------+ | number | name | initials | street | town | mobile | province | +---------+-----------+-----------+-----------------+----------+----------+-----------+ | 6 | Manzarek | R | Haseltine Lane | Phoenix | null | null | | 8 | Young | P | Brownstreet | null | 1234567 | null | | 15 | Metheny | M | null | null | null | South | +---------+-----------+-----------+-----------------+----------+----------+-----------+
The fact that the second and third record have less columns, is no problem for Drill.
Drill doesn’t need the schema definition of the
data source.
How Drill Enriches Self-Service Analytics 7
Copyright © 2015 R20/Consultancy, all rights reserved.
No Need to Import Data – Most SQL database servers can only process data stored in their database. Data from other sources must be imported first. Importing big data can be very time‐consuming and expensive. Imagine a data source where thousands of new records with sensor data are added every second. Such files can become massive and the copying process very slow. Big data is sometime too big to copy! Some SQL database servers have features for accessing external tables. For example, PostgreSQL supports the concept of foreign tables. Still, the schema of the foreign table must be imported in PostgreSQL first. Most SQL‐on‐Hadoop engines have a comparable solution. Data must be copied first to Hadoop files before it can be queried. And sometimes data must be copied from one Hadoop file to another just to get it in the right file format supported by the SQL‐on‐Hadoop engine and to flatten the data. For Drill there is no reason to import data. In fact, Drill doesn’t have its own data storage engine. It always uses the engine that belongs to the data source. There is no need to copy the data, making it readily available for analysis. This perfectly fits the needs of self‐service users.
No Need to Define a Flat and Fixed Schema – Initially, Drill was designed to support SQL on data sources using JSON data structures. This implies that Drill was designed to handle arrays, repeating groups, hierarchical structures, and so on. In fact, if a SQL implementation can handle every JSON data structure, it can handle any kind of data structure. For example, if it can handle complex data structures, it can definitely handle flat data structures, because flat data structures are the simplest forms of complex data structures. Likewise, if a product can handle variable schemas, it can definitely handle fixed schema, because they are the simplest forms of variable schemas. By implementing support for JSON, some form of superset of all the structures was implemented that can be found in data sources. Conclusion, because Drill was designed to handle complex and variable data structures, data doesn’t have to be transformed into flat data structures. In other words, data doesn’t need to be flattened beforehand. Data sources with complex and variable data structures have a tendency to change regularly. Quite often new records with a slightly different structure are added (new columns or elements have been added). A key advantage of Drill is that it has no problems with these changing data structures. If this happens, for Drill it’s business as usual. It discovers on the fly that there is a new column and responds properly. There is no need to extend the flat schema and adapt the ETL program, whilst this is needed for most other SQL‐on‐Hadoop engines. Drill offers a more flexible solution for handling variable and complex data, because it’s able to access data directly and because it does not have to copy the data.
No Need to Optimize the Physical Database Design – The performance of queries on many SQL databases is heavily determined by the physical design of the tables. Have the right indexes been defined, have the tables been partitioned properly, and have all the tablespace parameters been set optimally? If not, the performance can be dramatic. Before data is made available for analytics, IT commonly spends quite some time on getting the physical database design right. For Drill there is no need to spend time in advance to optimize the physical database design. Drill operates on the data source directly, and pulls data in memory and uses its own optimization technology to speed up queries.
Drill doesn’t require data to be flattened.
How Drill Enriches Self-Service Analytics 8
Copyright © 2015 R20/Consultancy, all rights reserved.
SQL-on-Everything – Everyone categorizes Drill as a SQL‐on‐Hadoop engine. In a way this makes sense, because Drill does allow access on data stored in Hadoop. However, Drill is, as described extensively in this whitepaper, not restricted to accessing data stored in Hadoop. Therefore, it’s time for a new category of SQL engines: SQL‐on‐Everything. Drill definitely classifies as a SQL‐on‐Everything engine.
5 The Architecture of Apache Drill This section describes Drill’s support for SQL, its internal architecture, and the performance optimization techniques to speed up query processing.
Drill’s Support for SQL – On the outside, Drill looks like any other SQL product. Therefore, through its ODBC/JDBC interfaces, it can be accessed by almost any tool for reporting and analytics; see Figure 2. So, every user of, for example, Tableau, Qlikview, and Spotfire, can analyze data stored in a wide range of data sources without having to rely on the IT department to import that data and transform it to a flat data structure. This really improves the self‐service level. It allows these analysts to analyze data sources much faster.
Figure 2 Most tools for reporting and analytics can use Apache Drill to access a wide range of data sources.
To be able to support a wide range of queries, Drill supports ANSI SQL. It includes all the standard features such as inner joins, left and right outer joins, aggregations (group‐by), statistical functions, window functions, correlated subqueries, and common tables expressions. This extensive SQL dialect of Drill is important, because more and more analytical tools expect the SQL engines to support all the advanced query capabilities. If Drill could not do that, many queries would not run on Drill. Drill supports user‐defined functions. Such functions can be used for analytical operations, but also for transforming schema‐less data in schema‐rich data. This makes schema‐less data available for more classic reporting and analytical tools. The processing of UDFs can be distributed over many nodes.
Drill classifies as SQL‐on‐Everything.
Drill supports ANSI SQL.
How Drill Enriches Self-Service Analytics 9
Copyright © 2015 R20/Consultancy, all rights reserved.
Although Drill doesn’t require a metadata store, in cases where there are already pre‐defined schemas, Drill uses the same metadata store as other SQL‐on‐Hadoop engines, such as Hive and Impala. So, table definitions entered by, for example Hive, can be read by Drill, and vice versa. This metadata store is accessible through the HCatalog interface.
Pushdown of SQL Operations – When a simple file is accessed by Drill, it must process all the operations specified in the SQL query. Such a file system is not capable of processing any operation. It can only retrieve records from the file. Therefore, Drill retrieves all the data from the file and processes all the operations specified in the query. Some data sources support their own query processing capabilities, such as SQL‐on‐Hadoop engines, SQL database servers, and some NoSQL database server. In such situations, Drill tries to push down as much of the query processing to those engines. For example, if the incoming query contains a filter, Drill pushes the filter operation to the data source, so that only the records adhering to the filter are returned for processing by Drill, instead of all the records being returned which leaves Drill to process the filter. This delegation of processing minimizes IO, resource utilization, and potential network traffic. Especially on big data sources, push down is an important performance optimization technique.
Dedicated Plugins – Drill supports a wide range of file and database plugins. For example, file plugins are available for CSV, AVRO, Parquet, JSON, HDFS, Amazon S3, and NFS. And database plugins exist for MongoDB, Apache HBase, and Apache Hive. Each plugin has been optimized for these data sources. Drill comes with a predefined set of plugins, but organizations can develop their own plugins for special data sources for which no plugin is available yet.
Data-Driven Query Compilation and Recompilation – In a classic SQL environment, when a query is entered, all the required schema information is obtained, statistical data is retrieved, the existence of indexes is determined, and finally an execution plan is developed. Statistical data such as the table cardinality is determined to predict what the best execution plan is. This execution plan describes exactly how the query is going to be processed. It describes which indexes are used, how tables are joined together, and so on. It contains aspects such as the number of columns in the query result, the data types of each column, which indexes to use to get the best performance, and so on. But it’s a fixed plan. When the execution plan has started, there is no turning back. Even when the processing takes much longer than expected, no switch is made to an alternative execution plan. Query optimizers exist that change execution plans if they discover the assumptions they made based on distribution of column values, and cardinalities of tables, are wrong. These optimizers change their execution plan on the spot. Still, the structure of the query result won’t change. Drill can’t make any assumptions, not with respect to the data structure neither the statistical data. In fact, that data is just not available for many data sources. Plus, when it starts to read a file, the data structure keeps changing due to complex and variable data structures. Depending on the data structures it finds when reading the data, it changes the execution plan on the spot by recompiling it. For this, Drill supports several techniques. Being able to recompile queries is probably Drill’s biggest claim to fame.
Drill tries to push down as much of the query processing to the
underlying data source.
Drill’s ability to recompile queries is its biggest claim
to fame.
How Drill Enriches Self-Service Analytics 10
Copyright © 2015 R20/Consultancy, all rights reserved.
Columnar Execution On Complex and Variable Data Structures – Most SQL products build up temporary results in memory in a record‐oriented fashion when they execute queries; see Figure 3. This makes sense because data is also stored in a record‐oriented fashion, and each record retrieved has the same set of columns. Record‐oriented storage fits well with transaction‐oriented workloads and with reports that access most of the columns of a table.
Figure 3 Most SQL products keep data in memory using a record-oriented structure. Column‐oriented storage and column‐oriented build‐up of data in memory, on the other hand, is more efficient for reporting and analytics where a limited number of columns is needed to create a result. Because Drill is designed to support analytics, it uses a column‐oriented structure when processing data in memory; see Figure 4. But besides that this structure is efficient for its own analytical workload, it also works very well for accessing tables with variable sets of columns. When Drill starts to retrieve data from a data source, it could be that the temporary result it has to build up in memory consists of X columns. For each column a memory area is reserved. In a way, these in‐memory columns are all independent of each other. When Drill suddenly encounters a record with one extra column, the only thing it must do is to build up an extra in‐memory column. All the results created so far stay unchanged. This would be much more difficult if Drill would keep data in memory in a record‐oriented fashion, because all the records already processed and stored in memory must be extended with this extra column. Sounds like a simple operation, but it can be quite resource intensive and time consuming. Therefore, Drill uses a columnar structure for data kept in memory.
A Distributed Query Processing Architecture – Drill comes with its own processing architecture. The architecture is based on a set of hierarchically organized modules called drillbits. Drillbits are responsible for executing SQL statements. A drillbit is installed on each node that holds data. A drillbit module is capable of executing SQL queries on the data that it manages. If data is stored across many nodes, all relevant drillbits are involved in the processing of the query thus parallelizing its execution. There is no master‐slave architecture in Drill. When applications access Drill, they are “connected” to different drillbits to avoid one drillbit module becoming responsible for the management of all the queries. Such a drillbit would become a bottleneck when many queries are executed by many applications. Query processing is distributed over as many drillbits as possible, always ensuring data locality.
Drill uses a columnar structure for data kept in
memory.
How Drill Enriches Self-Service Analytics 11
Copyright © 2015 R20/Consultancy, all rights reserved.
Figure 4 Drill keeps data in memory using a column-oriented structure.
6 Use Cases for Apache Drill Drill can be used in a wide range of use cases. This section describes a few.
Making Hadoop Data Easy to Analyze – The Hadoop stack supports a large set of modules for analyzing data. Most of them, such as MapReduce, Spark, and Pig, are too complex for business analysts, especially for those that are not too tech‐savvy. Drill helps make access to Hadoop data easy and makes it possible for every analyst to use their favorite analytical tool through Drill’s ODBC/JDBC interfaces.
Making the Data Lake Available for Analytics – Organizations are developing data lakes in which all the data useful for analytics and reporting is stored. It’s their central data inventory. The general recommendation is to use Hadoop for implementing the data lake. Some of the data in data lakes comes from production systems, data warehouses, and data marts, and is well‐structured and (probably) flat and fixed. In addition, schemas are available. But data lakes also contain files that are not managed by IT and for which no schema is available. Drill can make all this data easily available for analytics for almost every user. Drill opens up the flood gates to the data lake.
Making Non-Relational Data Available for Analytics – There is a wealth of information available in all kinds of files scattered around the organization. These files were generated with spreadsheets and other tools. They could contain results of manual studies by colleagues who summarized financial data on competitors from newspapers, market share data gathered over a long period of time copied from several sources, and results acquired from a study done by a company such as Dun and Bradstreet. These files are simple CSV files or JSON documents. Drill can turn these simple files into data sources that can be analyzed from all possible angles.
Integrating Data Marts with Other Data Sources – Users want to combine data stored in the data sources mentioned in this section with the more IT‐controlled data marts and data warehouses developed with
How Drill Enriches Self-Service Analytics 12
Copyright © 2015 R20/Consultancy, all rights reserved.
SQL database servers, such as Oracle, SQL Server, and Teradata. Drill supports access to these SQL databases using ODBC and JDBC plus access to the other data sources. Furthermore, it allows the two types of data sources to be joined seamlessly as if they form one logical database.
Making Open Data Available for Analytics – Every day more files with open data become publicly available. Examples are data sets containing weather‐related data, medical data, pollution data, socio‐demographic data, crime data, airport data, vehicle collision data, and so on. Especially governments are developing mountains of valuable data. At the time of writing there are 189,920 data sets with US government open data available at www.data.gov. Most of this data does not come with a predefined schema and is normally stored using JSON or XML. Drill is perfect for making this data available for SQL‐querying.
How Drill Enriches Self-Service Analytics 13
Copyright © 2015 R20/Consultancy, all rights reserved.
About the Author Rick F. van der Lans Rick F. van der Lans is an independent analyst, consultant, author, and lecturer specializing in data warehousing, business intelligence, big data, database technology, and data virtualization. He works for R20/Consultancy (www.r20.nl), a consultancy company he founded in 1987. Rick is chairman of the annual European Data Warehouse and Business Intelligence Conference (organized in London). He writes for Techtarget.com1, B‐eye‐Network.com2 and other websites. He introduced the business intelligence architecture called the Data Delivery Platform in 2009 in a number of articles3 all published at B‐eye‐Network.com. He has written several books on SQL: Introduction to SQL (fourth edition), SQL for MySQL Developers, The SQL Guide to SQLite, The SQL Guide to Ingres, The SQL Guide to Pervasive PSQL, and The SQL Guide to Oracle. Published in 1987, his popular Introduction to SQL4 was the first English book on the market devoted entirely to SQL. After more than twenty years, this book is still being sold, and has been translated in several languages, including Chinese, German, and Italian. His latest book5 Data Virtualization for Business Intelligence Systems was published in 2012. For more information please visit www.r20.nl, or email to [email protected]. You can also get in touch with him via LinkedIn and via Twitter @Rick_vanderlans.
About MapR Technologies, Inc. MapR provides the industry's only big data platform that combines the processing power of the top‐ranked Hadoop with web‐scale enterprise storage and real‐time database capabilities, enabling customers to harness the enormous power of their data. Organizations with the most demanding production needs, including sub‐second response for fraud prevention, secure and highly available data‐driven insights for better healthcare, petabyte analysis for threat detection, and integrated operational and analytic processing for improved customer experiences, run on MapR. A majority of customers achieve payback in fewer than 12 months and realize greater than 5X ROI. MapR ensures customer success through world‐class professional services and with free on‐demand training that 40,000 developers, data analysts and administrators have used to close the big data skills gap. Amazon, Cisco, Google, HP, SAP, and Teradata are part of the worldwide MapR partner ecosystem. Investors include Google Capital, Lightspeed Venture Partners, Mayfield Fund, NEA, Qualcomm Ventures and Redpoint Ventures. Connect with MapR on Twitter, LinkedIn, and Facebook
1 See http://www.techtarget.com/contributor/Rick‐Van‐Der‐Lans 2 See http://www.b‐eye‐network.com/channels/5087/articles/ 3 See http://www.b‐eye‐network.com/channels/5087/view/12495 4 R.F. van der Lans, Introduction to SQL; Mastering the Relational Database Language, fourth edition, Addison‐Wesley, 2007. 5 R.F. van der Lans, Data Virtualization for Business Intelligence Systems, Morgan Kaufmann Publishers, 2012.