The Bloor Group - The Internet Of Things Report| IOT · 2016-02-01 · A DAtAbAse PlAtform for the...

7
e Bloor Group WHITE PAPER A DATABASE PLATFORM FOR THE INTERNET OF THINGS ParStream “Walks the Walk” Robin Bloor, Ph.D. & Rebecca Jozwiak

Transcript of The Bloor Group - The Internet Of Things Report| IOT · 2016-02-01 · A DAtAbAse PlAtform for the...

Page 1: The Bloor Group - The Internet Of Things Report| IOT · 2016-02-01 · A DAtAbAse PlAtform for the Internet of thIngs 1 The Once and Future Internet of Things We think of the Internet

The Bloor Group

WHITE PAPER

A DAtAbAse PlAtform for the Internet of thIngsParStream “Walks the Walk”

Robin Bloor, Ph.D. & Rebecca Jozwiak

Page 2: The Bloor Group - The Internet Of Things Report| IOT · 2016-02-01 · A DAtAbAse PlAtform for the Internet of thIngs 1 The Once and Future Internet of Things We think of the Internet

A DAtAbAse PlAtform for the Internet of thIngs

1

The Once and Future Internet of Things We think of the Internet of Things (IOT) as new and dramatic. And indeed, no doubt its impact will soon be dramatic, but it is not as new as many people suppose. In respect to consumer-facing technology, we can trace the IOT back to the invention of ATMs in the late 1960s, nearly 50 year ago. True, we are pushing the definition of IOT here. The first ATMs were not connected, and their connected use only became widespread in the 1980s – and even then, although it was a connected device, it initially only qualified as an Intranet of Things.

Other such early applications included control systems in semi-automated factories. Sensors were placed at key points on production lines and would monitor activity in order to optimize the speed of the production line. These were the first real-time systems, and they were also implemented in chemical plants and oil refineries. RFID tags also emerged sooner than one might imagine, invented in 1973 and deployed in limited applications soon afterwards. They too became part of the early IOT.

A Trail of DisruptionThe history of computing is punctuated by periods of disruptive innovation that play havoc with the technology landscape. First came the PC, distributing computer power to everyone within, and eventually outside, the corporation. Then came the Internet, which networked what had become hundreds of millions of PCs to what eventually became hundreds of millions of servers that furnished them with Web content and services. Next came the march of mobile technology, extending applications to anyone almost anywhere at almost any time. The IOT is the fourth and probably final step in this proliferation of technology. Now technology will be distributed into places where only chips and sensors can go, giving us eyes, ears and the sense of touch in every location we might want to place it.

Each one of these innovations brought disruptive changes to software and database technology. The client/server nature of the PC revolution gave us the relational database which, with its standard SQL access, was almost tailor-made for server operation. The Internet remade the user interface in its own browser image and proliferated huge volumes of unstructured data, introducing content management systems to supplement the weaknesses of relational databases for content data. The mobile revolution ramped up the data volumes again, introducing the relatively new dimension of geographical location and amplifying the importance of the dimension of time. On one hand, it provoked a merging of two- and three-dimensional graphical data with more structured data, and on the other, it proliferated a whole series of social network connections. The IOT is equally disruptive. It will drive up data volumes like never before, introduce new and compelling streaming applications and spawn a wide range of new analytical applications.

New Species of DatabaseFor a while database technology seemed immune to this kind of disruption, but by about 2004 it began to stumble – long before the explosion of smart phones and tablets. CPUs became multicore, enabling new approaches to scaling out databases across large grids of servers. At the same time, the Internet was producing larger collections of data than had ever been amassed before. It began, of course, with Yahoo! and Google but quickly blossomed out to what we now call social media (LinkedIn, MySpace, Facebook, etc.) and the rapidly expanding industry of multiplayer games, where data on the activity of millions of individual game players was being collected and analyzed. The traditional databases were no longer so widely applicable, and in some areas they were utterly inadequate.

Page 3: The Bloor Group - The Internet Of Things Report| IOT · 2016-02-01 · A DAtAbAse PlAtform for the Internet of thIngs 1 The Once and Future Internet of Things We think of the Internet

A DAtAbAse PlAtform for the Internet of thIngs

2

Database Type

TypicalWorkload

Example Products

Traditional RDBMS

OLTP, data mart and data warehouse queries

Oracle, Microsoft SQL Server, IBM DB2, MySQL, ProgreSQL

Analytical DBMS

Analytical queries on very large data volumes

ParStream, Vertica, SybaseIQ

High Volume OLTP

Very high volumes of transactions

Aerospike, NuoDB, VoltDB

In-memory DBMS

OLTP and queries on relatively small data volumes

HANA, Kognitio, Altibase

NoSQL Document/object storage and retrieval

MongoDB, CouchDB, MarkLogic

Graph or RDF DBMS

Graphical/relationship queries

Neo4j, SPARQLverse, Stardog

Hadoop (HDFS)DBMS

Primarily ETL, data cleansing & some analytics

HBase, Impala, Splice Machine

Table 1. Different Database TypesThe above table provides a summary of the database landscape that has emerged. The traditional databases, including open source ones like MySQL and ProgreSQL are, of course, suited to the traditional OLTP, data mart and data warehouse workloads. They run into trouble when data volumes become very large, speed of ingest is very high or the response time needs to be extremely fast. When traditional databases attempt high volumes of both transactions and queries that require locks, the system will slow significantly. The analytical databases like ParStream and Vertica are built for very high volumes of data coupled with the need to manage high volumes of queries. The applications for such databases are mainly BI and analytics, similar to the traditional data warehouse. They are fast in part because they are not used for transactions and therefore don’t need locks.

There are also databases like Aerospike and VoltDB that specialize in extremely high volumes of OLTP transactions. This category is very close, but not identical, to the in-memory databases like SAP’s HANA or Kognitio, which simply focus on speed and response time. These can be useful to companies that wish to accelerate the speed of traditional business applications. The next two database categories move beyond the realm of the traditional RDBMS, providing for query workloads that do not work well on the indexed tables of an RDBMS. The NoSQL databases like MongoDB and CouchDB are built both to scale out and to manage nested data structures that are typical of documents and Web page content. The graph or RDF databases like Neo4j and Stardog are built to manage queries that trace out networks of relationships (who is connected to whom, or what is connected to what, and how).

The final category consists of databases whose common characteristic is that they are built to run on Hadoop’s HDFS file system. Currently, all of them seem to target the workloads of the traditional RDBMS, but with the nuance that they scale out better. They are unlikely to ever challenge either the analytical or high volume OLTP databases in respect to scale and capability, but as they mature, they may become attractive alternatives to the traditional RDBMS.

Page 4: The Bloor Group - The Internet Of Things Report| IOT · 2016-02-01 · A DAtAbAse PlAtform for the Internet of thIngs 1 The Once and Future Internet of Things We think of the Internet

A DAtAbAse PlAtform for the Internet of thIngs

3

For businesses who are selecting database products for a specific type of application, our advice is to determine which category of database they need before thinking of which products to investigate. While, as time passes, we can expect there to be some rationalization among these database categories, we expect most of them to persist with two or three products dominating each category. This is because the categories have been derived based on different types of workload, and we do not expect a database engine that is excellent in one of these categories to perform particularly well in other categories.

The Unique Challenge of the Internet of ThingsClearly, the majority of IOT systems will need analytical databases, because sensors and embedded processors will generate very large volumes of data. There will be two common types of application that run against such data, one which involves an urgent, near real-time reaction to data and one which involves a less urgent and more considered response:

• UrgentResponseApplications These applications may involve either sophisticated or relatively simple aggregations and calculations. Urgent action will be triggered when specific thresholds are passed. Think, for example, of an electricity grid where demand is escalating and reaches a point where new sources of power need to quickly be brought into service.

• Considered Response Applications These are more traditional analytics and BI applications that deduce valuable information from the data and pass such information to users to use in decision making.

The immediate challenge for a database which is employed for either type of application is to be able to deliver fast response times for very large volumes of data and, in particular, very large tables, with perhaps billions of rows. In general the vast volumes of IOT data are event data describing millions of events. The event records will likely have a relatively small number of attributes, and thus the database needs to be able to scan very large tables quickly and ingest data very quickly. The speed of ingest will be notably important for urgent response applications. Swift and effective data compression on ingest will also be essential.

While it may seem natural to think of an IOT database as being a centralized mountain of data accumulated from thousands of distributed sources, a centralized approach will not always be a practical way to process data for urgent response applications, which need to recognize trigger situations and respond to them immediately. With such an approach, moving extremely large amounts of data over the network can cause serious performance issues. While a central hub might not be overloaded by the data traffic, its ingest activity would slow it down and introduce an unacceptable lag into recognizing situations that demand an immediate response. A far better processing response can be assured if you move the processing to the data, rather than move the data to the processing. This, in fact, is an option that ParStream offers for such applications. It leaves the data where it is and distributes the processing.

The Nature of the ParStream DatabaseParStream is a massively parallel (MPP) analytical database that has been engineered to provide extremely fast response times on very large volumes of data. Aside from the technological features one would expect from such a product – shared-nothing architecture, column-store data organization, intelligent compression, low-latency, high-speed ingest and efficient data partitioning – ParStream has a distinct indexing capability that sets it apart from the competition.

Page 5: The Bloor Group - The Internet Of Things Report| IOT · 2016-02-01 · A DAtAbAse PlAtform for the Internet of thIngs 1 The Once and Future Internet of Things We think of the Internet

A DAtAbAse PlAtform for the Internet of thIngs

4

Figure 1: ParStream’s Distributed Architecture for IOT

ParStream uses patented, bitmapped indexing that embodies a unique way to compress the bitmaps. This delivers a performance boost that takes it beyond the level of competitor products, making it, as far as we are aware, the best performing database in the high-volume analytical database category.

In addition to being a powerful analytical database, ParStream is remarkably well-suited to IOT implementations, as is illustrated in Figure 1. It is important to understand that this federated mode of operation is not significantly different from how it operates when implemented on a single cluster.

So, working down from the top of the diagram, a variety of analytical apps with many users connect to ParStream. Each SQL query received is parsed and, because the database knows where the data is located, the query is broken down into a series of smaller SQL queries for each of the nodes. Each of these is taken to the appropriate instance of ParStream for execution. The central data (consisting of small tables to which the federated data may be joined) will also be distributed to the federated nodes for executing joins. ParStream replicates such data and keeps the replicas current if the data is updated. Once a query is resolved at the nodes, the partial answers are passed back to the geo-distributed analytics, where they are aggregated.

ParStream will also federate rules that reside at every location, as illustrated. Each node permanently ingests data, and thus rules may be fired off when, for example, specific threshold values are exceeded. If a rule applies across all data, it operates almost like a query, with

Page 6: The Bloor Group - The Internet Of Things Report| IOT · 2016-02-01 · A DAtAbAse PlAtform for the Internet of thIngs 1 The Once and Future Internet of Things We think of the Internet

A DAtAbAse PlAtform for the Internet of thIngs

5

aggregated values being passed to the central node and totals for all data being calculated. Warnings or alerts are issued (to users or applications) when the rule is triggered.

Aside from its ability to federate, ParStream fits this IOT environment well because of its speed and its small footprint at each node (the executable is about 50 megabytes). This means that the remote nodes need few resources (CPU, memory and storage) even when resilience is configured in. For example, a typical mobile device would be more than adequate as a server. This in turn provides a high level of flexibility on how an IOT system may be designed and deployed.

The point to note here is that IOT scenarios can vary considerably. In some situations, such as instrumenting a municipal traffic system, you might implement everything from scratch, deploying both sensors and new data servers. In other situations, a retail outlet for example, the data gathering devices and even a local server may already be in place and you simply need to implement software and perhaps more resilient networking. To make matters more complex, both the volumes of event data produced from various sources and the query workloads that a business might need to satisfy can also vary dramatically.

Aside from supporting federated operations, which is a must, the requirements for an IOT database are flexibility of deployment, the ability to run on private/public cloud infrastructures, efficiency of resource usage and speed of operation. This is what ParStream delivers.

ParStream In OperationCurrently most ParStream implementations serve typical analytical database applications. For example, MPREIS, a large Austrian supermarket chain headquartered in Austria, uses ParStream to analyze store purchases at the line item level, so that MPREIS can know profitability at a granular level as well as at a shopping basket level in its more than 200 outlets. The company was using QlikView to provide this analysis, but the application was limited to a data volume of 400 million records, representing just two weeks data. The initial implementation of ParStream enabled that limit to be increased to 6 months of data and the intention now is to extend the analysis to 2.5 years of data, amounting to 30 billion records.

In another analytical application, ParStream was deployed at MetaGenoPolis, which is funded by INRA (The French Institut National de la Recherche Agronomique). The goal of the project is to establish the impact of the human gut microbiota on health and disease. Consequently the activity involves analysis on large numbers of gut bacteria samples in order to establish correlations with other clinical tests, both in respect to nutritional and health impacts. In the lab, about 50 million bacteria are identified per sample, so the analytical workload is huge. ParStream improved the speed at which genomic indicators were discovered by a factor of 100 over the database technology that was previously used. The workload of this application is expected to grow well beyond that over the coming year.

ParStream in an IOT ApplicationWhile the IOT revolution has yet to take hold in a big way, there is already a significant investment in this area in the engineering industry. One example of where ParStream is deployed in this kind of application is in a project with Siemens (Siemens Research Technology Center). The goal of the engineering research project is to make detailed comparisons between different models of gas turbines in production.

Sensors within the gas turbines report readings which are captured by ParStream and monitored

Page 7: The Bloor Group - The Internet Of Things Report| IOT · 2016-02-01 · A DAtAbAse PlAtform for the Internet of thIngs 1 The Once and Future Internet of Things We think of the Internet

A DAtAbAse PlAtform for the Internet of thIngs

6

in real time. There are 5000 sensors in each gas turbine, and they produce about 180 billion records per year for analysis. Because ParStream is being deployed, there is no effective storage size limit for the data, and aside from immediate monitoring of the turbines, data analytics can be performed on large volumes of historical data as well. This has meant the ability to perform real-time, flexible analytics and the ability to compare multiple gas turbines on a granular level, both of which have given Siemens a significant understanding of the performance and efficiency of their assets.

In SummaryBusinesses who are looking to adopt an analytical database capable of handling the most demanding workloads on very large amounts of data would be wise to take a detailed look at ParStream. It delivers outstanding performance and is capable of near real-time analytics. It is particularly suited to the emerging area of IOT applications, where its federated architecture is not only able to process extremely large amounts of data, but it does so swiftly, in a widely distributed environment.

AboutTheBloorGroupThe Bloor Group is a consulting, research and technology analysis firm that focuses on openresearch and the use of modern media to gather knowledge and disseminate it to IT users.Visit both www.TheBloorGroup.com and www.InsideAnalysis.com for more information.

The Bloor Group is the sole copyright holder of this publication.Austin, TX 78720 | 512-524–3689