Hive, Spark, Presto for Interactive Queries on Big Data1247796/FULLTEXT01.pdf · to analyze data...

IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

Hive, Spark, Presto for Interactive Queries on Big Data

NIKITA GUREEV

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

TRITA TRITA-EECS-EX-2018:468

www.kth.se

Abstract

Traditional relational database systems can not be efficiently usedto analyze data with large volume and different formats, i.e. big data.Apache Hadoop is one of the first open-source tools that provides a dis-tributed data storage system and resource manager. The space of bigdata processing has been growing fast over the past years and many tech-nologies have been introduced in the big data ecosystem to address theproblem of processing large volumes of data, and some of the early toolshave become widely adopted, with Apache Hive being one of them. How-ever, with the recent advances in technology, there are other tools bettersuited for interactive analytics of big data, such as Apache Spark andPresto.

In this thesis these technologies are examined and benchmarked in or-der to determine their performance for the task of interactive business in-telligence queries. The benchmark is representative of interactive businessintelligence queries, and uses a star-shaped schema. The performance HiveTez, Hive LLAP, Spark SQL, and Presto is examined with text, ORC, Par-quet data on different volume and concurrency. A short analysis and con-clusions are presented with the reasoning about the choice of frameworkand data format for a system that would run interactive queries on bigdata.

Keywords: Hadoop, SQL, interactive analysis, Hive, Spark, SparkSQL, Presto, Big Data

1

Abstract

Traditionella relationella databassystem kan inte anvandas effektivt foratt analysera stora datavolymer och filformat, sasom big data. ApacheHadoop ar en av de forsta open-source verktyg som tillhandahaller ett dis-tribuerat datalagring och resurshanteringssystem. Omradet for big dataprocessing har vaxt fort de senaste aren och manga teknologier har in-troducerats inom ekosystemet for big data for att hantera problemet medprocessering av stora datavolymer, och vissa tidiga verktyg har blivit van-ligt forekommande, dar Apache Hive ar en av de. Med nya framsteg inomomradet finns det nu battre verktyg som ar battre anpassade for interak-tiva analyser av big data, som till exempel Apache Spark och Presto.

I denna uppsats ar dessa teknologier analyserade med benchmarks foratt faststalla deras prestanda for uppgiften av interaktiva business intelli-gence queries. Dessa benchmarks ar representative for interaktiva businessintelligence queries och anvander stjarnformade scheman. Prestandan arundersokt for Hive Tex, Hive LLAP, Spark SQL och Presto med text, ORCParquet data for olika volymer och parallelism. En kort analys och sam-manfattning ar presenterad med ett resonemang om valet av frameworkoch dataformat for ett system som exekverar interaktiva queries pa bigdata.

Keywords: Hadoop, SQL, interactive analysis, Hive, Spark, SparkSQL, Presto, Big Data

2

Contents

1 Introduction 41.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Benefits, Ethics and Sustainability . . . . . . . . . . . . . . . . . 51.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Big Data 72.1 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Hadoop Distributed File System . . . . . . . . . . . . . . . . . . 92.3 YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 SQL-on-Hadoop 153.1 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Presto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Experiments 324.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Performance Tuning . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Results 385.1 Single User Execution . . . . . . . . . . . . . . . . . . . . . . . . 385.2 File Format Comparison . . . . . . . . . . . . . . . . . . . . . . . 475.3 Concurrent Execution . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Conclusions 616.1 Single User Execution . . . . . . . . . . . . . . . . . . . . . . . . 616.2 File Format Comparison . . . . . . . . . . . . . . . . . . . . . . . 616.3 Concurrent Execution . . . . . . . . . . . . . . . . . . . . . . . . 626.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3

1 Introduction

The space of big data processing has been growing fast over the past years [1].Companies are making analytics of big data a priority, and meaning that in-teractive querying of the collected data becomes an important part of decisionmaking. With growing data volume the process of analytics becomes less inter-active, as it takes a lot of time to process the data for the business to receiveinsights. Recent advances in big data processing make interactive quieries, asopposed to only long running data processing jobs, to be performed on big data.Interactive queries are low lateny, sometimes ad hoc queries that analysts canrun over the data and gain valuable insights. The most important feature in thiscase is the fast repsonse from the data processing tool, making the feedback loopshorter and making data exploration more interactive for the analyst.

Many technologies have been introduced in the big data ecosystem to addressthe problem of processing large volumes of data, and some of the early tools havebecome widely adopted [2], with Apache Hive1 being one of them. However, withrecent advances in technology, there are other tools better suited for interactiveanalytics of big data, such as Apache Spark2 and Presto3. In this thesis Hive,Spark, and Presto are examined and benchmarked in order to determine theirrelative performance for the task of interactive queries.

There are several works taken into account during writing of this thesis. Similarwork was performed by atScale in 2016 [3], which claims to be the first work onthe topic of big data analytics. The report is done well, but the main issue isthat with the current pace in the development of technologies the results fromseveral years before can become outdated and less relevant in deciding whichdata processing framework to use. Another work in similar vein is SQL Enginesfor Big Data Analytics [4], but the main focus on that work is in the domain ofbioinformatics, which lessens the relevance of the work for business intelligence.The work was also done in 2015, making it even older than the atScale report.Performance Comparison of Hive Impala and Spark SQL [5] from 2015 was alsoconsidered, but has its drawbacks. Several other works served as references inchoosing the method and setting up the benchmark, including Sparkbench [6],BigBench [7], and Making Sense of Performance in Data Analytics Frameworks[8].

1.1 Problem

How is the performance on interactive business intelligence queries impacted byusing Hive, Spark or Presto with variable data volume, file format, and numberof concurrent users?

1Apache Hive - https://hive.apache.org/2Apache Spark - https://spark.apache.org/3Presto - https://prestodb.io/

4

https://hive.apache.org/

https://spark.apache.org/

https://prestodb.io/

1.2 Purpose

The purpose of this thesis is to assess the possible performance impact of switch-ing from Hive to Spark or Presto for interactive queries. Usage of the latest ver-sions of frameworks makes the work more relevant, as all three of the frameworksare undergoing rapid development. Considering the focus on interactive queries,several aspects of the experiments are changed from the previous works, includ-ing choice of the benchmark, experimental environment, file format.

1.3 Goals

The main goal of this thesis is to produce an assessment of Hive, Spark, andPresto for interactive queries on big data of different volume, data format, and anumber of concurrent users. The results are used to motivate a suggested choiceof framework for interactive queries, when a rework of the system is performedor a creation of a new system planned.

1.4 Benefits, Ethics and Sustainability

The main beneficial effect of this thesis is a fair comparison of several big dataprocessing frameworks in terms of interactive queries conducted independently.This will help with the choice of tools when implementing a system for runninganalytical querying with constraints on responsiveness and speed on hardwareand data corresponding to the setup in this work.

As this thesis uses some of the state-of-the-art versions of frameworks in ques-tion, this include all of the improvements that were absent from previous similarworks, while ensuring that no framework is operating under suboptimal condi-tions and no framework is given special treatment and tuning.

1.5 Methods

Empirical method is used, as analytical methods cannot be efficiently appliedto the presented problem within the resource and time constraints [9]. Theresults will be collected by generating data of different volume, implementing aninteractive query suite, tuning the performance of the frameworks, and runningthe query suite on the data. This follows an established trend by the mostrelevant previous works [3], [4], [5], making changes in line with the focus of thisthesis.

5

1.6 Outline

In the Big Data section the big data ecosystem is described, with emphasis onHadoop and YARN. In the SQL-on-Hadoop section the data processing frame-works are presented, first Hive, then Presto, then Spark. The ORC and Parquetfile formats are also briefly described. In the Experiments section the benchmarkand experimental setup are described. In the Results all of the experimentalresults are outlined and briefly described. In Conclusions the results are sum-marized and conclusions are driven, with future work outlined.

6

2 Big Data

This thesis project is focused on comparing the performance of several big dataframeworks in the domain of interactive business intelligence queries. Initially,works in big data space were making long-running jobs their focus, but with theadvance of tools in big data processing it becomes more common for companiesto be able to execute interactive queries over aggregated data. In this sectionthe big data ecosystem is described, with a common Hadoop setup.

2.1 Hadoop

Apache Hadoop is a data processing framework targeted at distributed pro-cessing of large volumes of data on one or more clusters of nodes running oncommodity hardware. Hadoop is an open-source project under Apache Foun-dation, with Java being the main implementation language [10].

The main components of Hadoop project are:

• Hadoop Common: The common utilities that support other Hadoop mod-ules

• Hadoop Distributed File System (HDFS): A distributed file system thatprovides high-throughput access to application data

• Hadoop YARN: A framework for job scheduling and cluster resource man-agement

• Hadoop MapReduce: A YARN-based system for parallel processing oflarge data sets

Hadoop is an efficient solution to big data processing as it enables large scaledata processing workloads relatively cheaply by using commodity hardwareclusters [11]. Hadoop provides fast and efficient data processing and fault-tolerance.

2.1.1 MapReduce

MapReduce [12] is a shared nothing architecture, meaning that every node isindependent and self-sufficient, for processing large datasets with a distributedalgorithm on clusters of commodity hardware. Hadoop uses MapReduce asthe underlying programming paradigm. MapReduce expresses distributed com-putations on large volume of data as a sequence of distributed operations onkey-value pair datasets. The Hadoop MapReduce framework utilizes a clusterof nodes and executes MapReduce jobs defined by a user across the machinesin the cluster.

7

Figure 1: A MapReduce computation

Figure 1 shows a MapReduce computation. Each MapReduce computation canbe separated into three phases: map, combine, and reduce. In the map phasethe input data is split by the framework into a large number of small fragmentsthat are subsequently assigned to a map task. The framework takes care ofdistributing the map tasks across the cluster of nodes, on which it operates. Eachof these map tasks starts to consumer key-value pairs from the fragment thatwas assigned to it and produces intermediate key-value pairs after processing it[12].

The combine or shuffle and sort phase is performed next on the intermediatekey-value pairs. The main objective of this phase is to prepare all of the tu-ples for the reduce phase by placing the tuples under the same key togetherand partitioning them by the number of machines that are used in the reducephase.

Finally, the reduce phase each reduce task consumes the fragment of interme-diate tuples that is assigned to it. For each of the tuples a user defined reducefunction is invoked and produces an output key-value pair. The frameworkdistributes the workload across the cluster of nodes.

One important aspect of MapReduce is that if a map and reduce task is com-pletely independent of all other concurrent map or reduce tasks, it can be safelyrun in parallel on different keys and data. Hadoop provides locality awareness,meaning that on a large cluster of machines it tries to match map operations tothe machines that store the data that the map needs to be run on.

8

Figure 2: Hadoop MapReduce architecture

Figure 2 shows the architecture of the Hadoop MapReduce framework. It isimplemented using master/worker architecture. There is a single master servercalled Jobtracker and a number of worker servers that are called Tasktrackers,one for every cluster. Users interact with the Jobtracker by sending MapReducejobs to it that are subsequently put in the pending jobs queue. The jobs in thequeue are executed first-in-first-out (FIFO). Jobtracker assigns the individualmap and reduce tasks to Tasktrackers, which handle the task execution anddata motion across MapReduce phases. The data itself is commonly storedon a distributed file system, frequently on Hadoop Distributed File System[13].

2.2 Hadoop Distributed File System

The main purpose of Hadoop Distributed File System (HDFS) [13] is to reliablystore very large files that do not fit into a single hard drive of a node acrossnodes in a large cluster. The initial inspiration for HDFS was the Google FileSystem [14].

HDFS is highly fault-tolerant and is designed to run on commodity machines.A commodity machine is an already-available computing component for parallelcomputing, used to get the greatest amount of useful computation at low cost[11]. In this environment the failure of hardware is expected and needs to behandled routinely. One HDFS instance can consist of thousands of server nodeswith each node being responsible for parts of the data. HDFS makes failuredetection and automatic recovery a core architectural goal.

9

Some of the projects in Hadoop ecosystem rely on streaming access to the datasets. The initial goal of HDFS was to provide more batch than interactive accesswith emphasis on high throughput rather than low latency of data access. Withthis in mind HDFS was designed with some of the POSIX4 semantics traded offfor increase in data throughput and to enable streaming access to file systemdata.

One of the assumptions HDFS makes is that applications relying on it requirea write-once-read-many access model for files. Once a file is created, writtento and closed, it does not need to be changed. This assumption simplifiesthe resolution of data coherency issues and enables the high throughput fordata.

The data replication is one of the main architectural goals of HDFS. It is de-signed to reliably store very large files across many nodes in a cluster. Each fileis stored as a sequence of blocks and all of them are the same fixed size, exceptfor the last block that can be less or equal to the configured block size. Theblocks are replicated in order to provide fault tolerance with the block size andreplication factor being configurable by an application.

Figure 3: HDFS architecture. Image adapted from HDFS Design page in Apachewiki6

4POSIX - http://www.opengroup.org/austin/papers/posix faq.html6HDFS Design page in Apache wiki - https://hadoop.apache.org/docs/current/

hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

10

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

Figure 3 shows the architecture of HDFS. HDFS has a master/worker archi-tecture and a single cluster consists of a single master node called Namenodeand multiple worker nodes called Datanodes, one per cluster. The Namenodemanages the file system namespace and regulates the file access by clients. TheDatanodes manage storage that they are attached to and are responsible forserving read and write requests from the client. HDFS exposes the file systemnamespaces and allows users to store data in files. A file is split into one ormore blocks and the Datanodes are responsible for their storage. The Namen-ode executes the file system operations and is responsible for opening, closing,renaming files and directories. The Datanodes execute block creation, deletion,and replication on requests from the Namenode.

HDFS supports a hierarchical file organization. Users can create and removefiles, move a file from one directory to another, or rename a file. The Namen-ode maintains the file system namespace. The Namenode records any changeto the file system namespace or the properties of it. The number of replicasof a file that should be maintained by HDFS can be specified by a user andis called the replication factor of that file, which is stored by the Namenode.In addition, the Namenode is responsible for the replication of blocks. It pe-riodically receives a Heartbeat and a Blockreport from each of the Datanodesin the cluster. Datanode sends a list of all blocks that are stored on it in aBlockreport. If a Heartbeat is received then the Datanode is functioning andhealthy. An absence of a Heartbeat point to a network partition, a subset ofDatanodes losing connectivity to the Namenode. The Namenode then marks allof the Datanodes without recent Heartbeats as dead and stops forwarding anynew requests to them. All of the data stored on dead Datanodes is unavailableto HDFS, leading to new replica placement in case the death of Datanodes letsome files replication factor go below the specified value.

Replica placement is a critical factor in providing fault tolerance. HDFS aimsto place replicas of data on different racks in order to prevent data loss in case ofa whole rack of servers experiencing failure. There is a simple, but non-optimalpolicy placing each replica on a unique rack providing fault tolerance for entirerack failure and bandwidth of reading from separate racks at an increased costof writes that needs to be transferred across multiple racks. For the commoncase of replication factor set to three replicas, HDFS places one replica on alocal rack, another on a remote rack, and a final one on a different machine atthe same remote rack. The rationale behind this is that node failure is muchmore common then rack failure, and having two copies on the same remote rackimproves write performance, without trading off too much reliability and readperformance.

HDFS provides data integrity by using checksums. Corruption of the datastored in a HDFS block is possible, due to either disk storage, network faults,or bugs in software. When a client retrieves a file it verifies the content agains achecksum stored in a hidden file in the same HDFS namespace as the file.

11

2.3 YARN

Yet Another Resource Negotiator (YARN) [15] is an Apache Hadoop compo-nent that is dedicated to resource management for Hadoop ecosystem. Ini-tially Hadoop was focused on running MapReduce jobs for web crawls, butlater became used more widely for very different applications. The initial de-sign included tightly coupled programming model and resource managementinfrastructure and centralization of job control flow handling, leading to scala-bility issues of the scheduler. The aim of YARN is to remedy that. The mainrequirements for YARN are listed as follows.

1. Scalability - an inherent requirement from running on big data

2. Multi-tenancy - as resources are typically requested by multiple concurrentjobs

3. Serviceability - decoupling of upgrade dependencies in order to accommo-date slight difference in Hadoop versions

4. Locality Awareness - a key requirement to minimize overhead of sendingdata over network

5. High Cluster Utilization - making it economical and minimize time spentunused

6. Reliability/Availability - continuous monitoring of jobs to provide fault-tolerance

7. Secure and Auditable Operation - a critical feature for multi-tenant system

8. Support for Programming Model Diversity - required by the ever growingHadoop ecosystem

9. Flexible Resource Model - separation of map and reduce tasks can bottle-neck resources and requires careful management of resources

10. Backward compatibility - facilitate adoption

12

Figure 4: YARN System Architecture. Image adapted from YARN Design pagein Apache wiki8

Figure 4 shows the architecture of YARN. The two main entities are the globalResource Manager (RM) and an ApplicationMaster (AM), one per application.The ResourceManager monitors how the resources are used and the livenessof nodes, enforces the resource allocation invariants, and acts as an arbiter inresource contention among tenants. The responsibilities of ApplicationMasterare the coordination of the logical plan for a job by issuing resource requests tothe ResourceManager, generating the physical plan from the received resourcesand coordinating the plan execution.

The ResourceManager, acting as a central authority can ensure the fairness andlocality requirements across tenants. Based on the issued requests from the ap-plications, the ResourceManager can dynamically allocate leases or containersto applications to be executed on particular nodes. Each node runs a specificdaemon called NodeManager (NM) that helps enforce and monitor these assign-ments. NodeManagers are also responsible for tracking the resource availability,reporting node failures, and managing the lifecycle of containers. From thesesnapshots of state from NMs the RM can create a global view of the state.

Each job submitted to the RM goes through admission control phase, duringwhich the security credential are validated and administrative checks are per-formed, due to the secure and auditable operation requirement. If everythingis in order, the job state is set to accepted and it is passed to the Scheduler.

8YARN Design page in Apache wiki - https://hadoop.apache.org/docs/current/

hadoop-yarn/hadoop-yarn-site/YARN.html

13

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

All of the accepted applications are recorded in persistent storage in order torecover in case of ResourceManager failure. After that the Scheduler acquiresthe necessary resources and sets the job state to running: a container is allo-cated for the ApplicationMaster and it is spawned on one of the nodes in thecluster.

The ApplicationMaster manages all of the details of the lifecycle, such as re-source consumption elasticity, management of flow execution, failure handling,local optimizations. This provides the scalability and programming model flex-ibility, as the ApplicationMaster can be implemented in any programming lan-guage, as long as it satisfies the few assumptions YARN makes.

Commonly, the ApplicationMaster will require resources from multiple nodes inthe cluster. In order to obtain the leases, AM sends a resource request to RMwith the locality preferences and container properties. The ResourceManagermakes an attempt to satisfy the requests from each application according to thespecified availability and scheduling policies. Each time a resource request isgranted, a lease is generated and pulled with the heartbeat from the AM. Atoken-based security mechanism guarantees that a request is authentic when arequest is passed from AM to NM. The AM encodes a launch request that isspecific to the application on container start. Running containers can directlyreport status and liveness to and receive commands specific to the frameworkfrom the AM without any dependencies on YARN.

14

3 SQL-on-Hadoop

With the development of big data ecosystem the tools for big data processingstarted to become more sophisticated and easy to use. One of the major re-quirements for analytics tools is the ability to execute ad hoc queries on thedata stored in the system. SQL, being one of the main language used for thepurposes of data querying9 became the standard most frameworks try to inte-grate, creating a number of SQL-like languages. In this section several big dataprocessing frameworks, their architecture and usage are presented.

3.1 Hive

Apache Hive [16] is a data warehouse system that supports read and writeoperations over datasets in a distributed storage. The queries support SQL-likesyntax by introducing a separate query language called HiveQL10. HiveServer2(HS2) is a service enabling clients to run Hive queries11.

Figure 5: Hive System Architecture. Image adapted from Hive Design page inApache wiki13

Figure 5 shows the major components of Hive and how it interacts with Hadoop.There is a User Interface (UI)x that is used for system interaction, initially being

9https://www.tiobe.com/tiobe-index/10Hive Language Manual - https://cwiki.apache.org/confluence/display/Hive/

LanguageManual11https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview13Hive Design page in Apache wiki - https://cwiki.apache.org/confluence/display/

Hive/Design

15

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

https://cwiki.apache.org/confluence/display/Hive/Design

https://cwiki.apache.org/confluence/display/Hive/Design

only a command line interface (CLI). The Driver is the component that receivesthe queries, provides the session handles, and an API modelled on JDBC/ODBCinterfaces14. The Compiler parses the query, does semantic analysis of the queryblocks and expressions, and is responsible for creating an execution plan withrespect to the table and partition metadata, which is stored in the metastore.The Metastore stores all of the structural information about the different tablesand partitions in the warehouse, including the information about columns andtheir types, the serializers and deserializers used to read and write data to HDFS,and the location of data. The Execution Engine is the component that actuallyexecutes the plan that was created by the compiler. The plan is representedby a Directed Acyclic Graph (DAG) of stages. The component is responsiblefor dependency management between stages and the execution of stages ondifferent components. Figure 5 also shows how a query typically flows throughthe system. A call from UI to Driver is issued (step 1). Next, the Driver createsa session handle for that query and sends the query to the Compiler, whichgenerates an execution plan for the query (step 2). After that the Compiler getsthe required metadata from the Metastore (steps 3, 4). Then the Compiler cangenerate the plan, while type-checking the query and pruning partitions basedon the predicates in the query (step 5). The execution plan consists of stages,each being a map or reduce job, metadata operation, or an HDFS operation.The execution engine submits the stages to the responsible components (steps6-6.3). In each map or reduce task the deserializer associated with the tableoutputs intermediate results, which are written to HDFS file using a serializer.These files are used for the map and reduce tasks that are executed afterwards,with a fetch results call for Driver being done for the final query result (steps7-9).

3.1.1 Hive Data Model

There are three main abstractions in Hive Data Model: tables, partitions, andbuckets. Hive tables can be viewed as similar to tables in relational databases.Rows in the Hive tables are separated into columns with types. Tables supportfilter, union, and join operations. All the data in a table is stored in a HDFSdirectory. External tables, which can be created on pre-existing files or directo-ries in HDFS, are also supported by Hive [17]. Each table can have one or morepartition keys that determine how the data is stored. The partitions allow thesystem to prune data based on query predicates, in order not to go through allof the stored data for every query. The data stored in partition can be dividedinto buckets, based on the hash of the column. Each bucket is stored as a filein a directory. The buckets allow more efficient evaluation that depend on asample of data.

In addition to supporting primitive types, Hive has support for arrays and maps.Users can also create user-defined types composed of primitives and collections.

14https://docs.oracle.com/javase/8/docs/technotes/guides/jdbc/

16

https://docs.oracle.com/javase/8/docs/technotes/guides/jdbc/

The Serializaer and Deserializaer (SerDe) and object inspector interfaces providethe necessary hooks to serialize and deserialize data of User-Defined Functions(UDFs) and have their own object inspectors.

3.1.2 Tez

The initial design of Hive relied on MapReduce for execution, but was laterreplaced by Apache Tez [18]. The foundation for building Tez was laid duringthe breaking resource management out of Hadoop that resulted in YARN [15]and enabled an architectural shift. MapReduce could be viewed as just one ofthe applications to be run on top of YARN. Tez made the next step by creatinga flexible and extensible foundation with support of arbitrary data-flow orientedframeworks.

Apache Tez provides an API that allows clear modelling of logical and physicaldata flow graphs. This allows users to model computation as a DAG with finerdecomposition compared to classic vertex-edge models. In addition,ApplicationProgramming Interfaces (APIs) to modify the DAG definition on the fly areavailable, enabling more sophisticated optimizations in query execution. Tezalso supports efficient and scalable implementation of YARN features - localityawareness, security, reuse of resources, and fault-tolerance.

Tez solves the orchestrating and executing a distributed data application onHadoop, providing cluster resource negotiation, fault tolerance, elasticity, secu-rity, and performance optimization. Tez is composed of several key APIs thatdefine data processing and orchestration framework that the applications needto implement to provide an execution context. This allows Tez to be applicationagnostic. Computation in Tez is modelled as an acyclic graph, which is naturaldue to the process of data flowing from source to sinks with transformationhappening on the in-between vertices. A single vertex in a Tez DAG API is arepresentation of some data processing and is a single step in the transforma-tion. A user provides a processor that defines the underlying logic to handledata. It is quite common for multiple vertices to be executed in parallel acrossmany concurrent tasks. An edge of the DAG is a physical and logical repre-sentation of data movement from the vertex that produces data into the onethat consumes it. Tez supports one-to-one, broadcast, and scatter-gather datamovement between producers and consumers.

After the DAG API defines the structure of the data pipeline, the RuntimeAPI is used to inject application logic. While a DAG vertex represents a dataprocessing step, the actual transformation is applied by executing tasks on themachines in the cluster. Each of the tasks is defined as a composition of a setof inputs, processor, and outputs (IPO). The processor is defined for each taskby the vertex, and the output classes of incoming edge define the inputs. Theinput classes of outgoing edges define the outputs of the vertex. Some of thecomplexity is hidden by this representation, such as the underlying data trans-

17

port, partitioning, aggregation of shards, which are configured after the creationof IPO objects. IPO configuration is done by binary payloads. The processoris presented with inputs and outputs and communicates with the frameworkby event exchange through a context object. Fault-tolerance is achieved byre-executing tasks to regenerate data on reception of an ErrorEvent.

Tez does not specify any format of data and is not a part of the data plane duringthe execution of the DAG. The data transfer is performed by the inputs andoutputs with Tez serving as a router for producers and consumers. This allowsTez to have minimal overhead and makes Tez data format agnostic, allowing theinputs, processors, and outputs to choose their data formats themselves.

To enable dynamic reconfiguration of the DAG and to adapt the execution onthe fly, Tez uses a VertexManager. Each vertex in the DAG is associated witha VertexManager responsible for reconfiguration of vertices during runtime. Anumber of different state machines are used to control the lifecycle of verticesand tasks that interact with VertexManager through states. The VertexManageris notified of the state transitions through a context object, facilitating thedecisions on the DAG configuration change.

Apache Tez provides automatic partition cardinality estimation as a runtimeoptimization. For instance, that is used to solve a common problem in MapRe-duce - determining the required number of tasks for reduce phase. That dependson the volume of data shuffled from mappers. Tez can produce a statisticalestimate of total data size and make an estimation of the total required num-ber of reducers at runtime. In addition, by having the fine-grained controlover DAG vertices, Tez provides scheduling optimizations by providing locality-aware scheduling, minimizing out-of-order scheduling, reducing the number ofscheduling deadlocks and having specific deadlock detection and preemption totake care of these situations.

Multi-tenancy is a common requirement in data processing and the discretetask-based processing model of Apache Tez provides very good support for it.Resource allocation and deallocation based on a task a single unit enables highutilization of cluster resources, with the computational power being transferredfrom applications that no longer require it to those that do. In Tez each taskis executed in a container process that guarantees this resource scalability andelasticity.

Hive 0.13 was the first version to take advantage of Tez, utilizing the increasedefficiency of translating SQL queries into Tez DAGs instead of using MapRe-duce.

18

Figure 6: MapReduce and Tez for Hive. Images adapted from Tez page inApache wiki16

Before switching to Tez, Hive used MapReduce for query execution. That ledto potential inefficiencies, due to several factors. One instance are queries withmore than one reduce sinks that are not possible to combine due to absenceof correlation in partition keys. In MapReduce this would lead to separateMapReduce jobs being executed for a single query, as shown on figure 6. Eachof the MapReduce jobs would read from and write to HDFS, in addition to datashuffling. In Tez, on the other hand, the query plan would be pipelined andlinked directly, referred to a map-reduce-reduce pattern.

3.1.3 LLAP

In Hive 2.0 the Live Long And Process (LLAP) functionality was added17. Whilethe improvements introduced in [17] and [18], a new initiative to introduceasynchronous spindle-aware I/O, pre-ferching and caching of column chunks,and multithreaded JIT-friendly operator pipelines was created. LLAP providesa hybrid execution model that consists of a long-lived daemon that works inplace of the direct communication with the Datanode in HDFS, and a tightlyintegrated framework based on DAGs.

LLAP daemons include the caching, pre-fetching, and some query processingfunctionality. Some of the small and short queries are mostly directly passed toand processed by the daemon, but all the heavy queries still are the responsibilityof the YARN container.

In addition, similar to how the Datanode is designed, LLAP daemons are ac-cessible to other applications, which can be useful in the case when a relational

16Tez page in Apache wiki - http://tez.apache.org/17LLAP page in Apache wiki - https://cwiki.apache.org/confluence/display/Hive/LLAP

19

http://tez.apache.org/

https://cwiki.apache.org/confluence/display/Hive/LLAP

view of data can be preferable to file-centric one. The API offers InputFormatthat can be used by data processing frameworks.

Figure 7: LLAP. Image adapted from LLAP page in Apache wiki18

Figure 7 shows how a job is processed with LLAP and a Tez AM, which coor-dinates the whole execution. Initially, the input query is passed into the LLAP.After that the costly operations such as the shuffles are performed in separatecontainers during the reduce stage. LLAP can be accessed by multiple concur-rent queries or applications at the same time.

To achieve the goals of providing JIT optimization and caching, while reducingstartup costs, the daemon is executed on worker nodes in the cluster and handlesthe I/O operations, caching, and execution of query fragments. Any request toa LLAP node contains the location of data and the associated metadata. Whilethe concerns about data locality are left to YARN, the processing of local andremote data locations is handled by LLAP. Fault-tolerance overall is simplified,as any data node can be used to execute any query fragment, which is doneby the Tez AM. Similar to Tez, direct communication between Hive nodes ispermitted.

The daemon always tries to use multiple threads for the purposes of I/O andreading from a compressed format. As soon as data becomes ready it is passed toexecution in order for the previous batch to be processed concurrently with thepreparation of the next one. LLAP uses a RLE-encoded columnar format thatis used for caching, minimizing copy operations in I/O, execution and cache,and also in vectorized processing. The daemon caches data itself and index andmetadata for input files, sometimes even for data that is not currently in thecache. The eviction policy is pluggable, but the default is a LRFU policy. A

18LLAP page in Apache wiki - https://cwiki.apache.org/confluence/display/Hive/LLAP

20

https://cwiki.apache.org/confluence/display/Hive/LLAP

column chunk is a unit of data in cache, making a compromise between efficiencyof storage and low overhead in data processing. Dynamic runtime filtering isachieved using a bloom filter [19].

In order to preserve the scalability and flexibility of Hive, LLAP works using theexisting processed-based Hive execution. The daemons are optional and Hivecan bypass them, even in the case when they are running and deployed. UnlikeMapReduce or Tez, LLAP is not an execution engine by itself. The existingHive execution engines are used to schedule and monitor the overall execution.LLAP results can be a partial result of a Hive query or can be passed to ex-ternal Hive task. Resource management is still a responsibility of YARN, withthe YARN container delegation model being used by LLAP. The data cachinghappens off heap in order to overcome the JVM memory settings limitations- the daemon can initially use only a small amount of CPU and memory, butadditional resources can later be allocated based on the workload.

For partial execution LLAP can take fragments of a query, e.g. a projection ora filter and work with it. For security and stability only Hive code and someUDFs are accepted by LLAP and the code is localized and executed on thefly. Concurrent execution of several query fragments from different queries andeven sessions is allowed. Users can gain direct access to LLAP nodes using clientAPI.

3.2 Presto

Apache Presto [20] is a distributed massively parallel query execution enginedeveloped by Facebook in 201319. Presto can process data from multiple sourcessuch as HDFS, Hive, Cassandra. There are many other available connectorsmaking the integration of new data sources easy.

Presto is designed to query over data where it is stored, rather than movingdata into a single storage. Presto is able to combine multiple data sources in asingle query, which makes data integration easier. However, Presto can not actas a general purpose relational database, as it is not designed to handle OnlineTransaction Processing (OLTP).

19https://www.facebook.com/notes/facebook-engineering/

presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920/

21

https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920/

https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920/

Figure 8: Presto System Architecture

Figure 8 shows the architecture of Presto. There are two types of nodes in aPresto cluster: Coordinators and Workers. The main responsibilities of Prestocoordinator are parsing the input queries, planning query execution, and man-aging the worker nodes. A Presto worker is responsible for executing tasks andprocessing data fetched from connectors and exchanged between workers.

22

Figure 9: Presto Task Execution

Presto does not use MapReduce, but instead uses message passing to executequeries, removing the overhead of transitions between map and reduce phasesthat is present in Hive, as shown on Figure 9. That leads to improvementsin performance, due to all of the stages being pipelined, without any overheadfrom disk interaction. However, that reduces the fault-tolerance, as lost datawill need to be recomputed. Another limitation is that data chunks need to fitin memory.

Presto is first and foremost a distributed system running on a cluster of nodes.Presto query engine is optimized for interactive analysis of large volume of dataand supports standard ANSI SQL, including complex queries, aggregations,joins, and window functions. Presto supports most of the standard data typescommonly used in SQL.

23

When an input query is sent from the client node it is first received by the co-ordinator that parses it and creates a query execution plan. Then the schedulerassigns worker nodes that are closest to the data to minimize the network over-head of data transfer. The workers perform data extraction through connectors,when necessary, execute the created distributed query plan and return results tothe coordinator, which returns the final result of the query to the client.

Presto is not specific to Hadoop and can work with multiple datasources in thesame query, making Presto more suitable to cloud environments that do not useHDFS as storage. There are a variety of connectors to data sources, includingHDFS, Amazon S3, MySQL, Apache Hive, Apache Kafka, Apache Cassandra,PostgreSQL, Redis, and more20.

3.3 Spark

Apache Spark is a unified engine for distributed data processing that was createdin 2009 at the University of California, Berkeley [21]. The first release of it wasin 2010 and since then Apache Spark has become one of the most active opensource projects in big data processing. It is part of Apache Foundation and hasbeen over 1000 contributors to the project [22]. Spark offers several frameworks10, including Spark SQL for analytics, Spak Streaming, MLlib for machinelearning, and GraphX for graph-specific processing.

Figure 10: Spark ecosystem. Image adapted from Apache Spark page21

20Presto Connectors - https://prestodb.io/docs/current/connector.html21Apache Spark - https://spark.apache.org/

24

https://prestodb.io/docs/current/connector.html

https://spark.apache.org/

3.3.1 Architecture

Each Spark program has a driver running the execution of the different concur-rent operations on the nodes in the cluster. The main abstraction is a resilientdistributed dataset (RDD) that represents a collection of objects partitionedover the Spark nodes in the cluster. Most commonly it can be created by read-ing in a file from HDFS. Another important abstraction in Spark is the conceptof shared variables. With the default configuration during each concurrent ex-ecution of a function a copy of each variable is sent to each task. However,in some cases variables need to be shared across tasks, or with the driver. Inthese cases Spark supports two types of shared variables: broadcast variablesthat are used as a method to cache a value in memory, and accumulators thatare mutated only by adding to them, for instance sums and counts.

Figure 11: Spark architecture. Image adapted from Cluster Overview page inApache wiki23

Figure 11 shows the architecture of Apache Spark. Spark utilizes a master/-worker architecture with a single Driver node and multiple Worker nodes. EachWorker node contains an Executor that receive and execute the applicationcode from the SparkContext running on the Driver. Each application has sep-arate executor processes that are independent from other applications, whichremain for the whole lifecycle of the application. The executor processes canrun tasks in multiple threads. This provides isolation of applications at the costof them being unable to directly share data without using an external storagesystem.

Spark supports four different cluster managers. The cluster manager can beeither Spark standalone one, Apache Mesos, Hadoop YARN, or Kubernetes.

23Cluster Overview page in Apache wiki - https://spark.apache.org/docs/latest/

cluster-overview.html

25

https://spark.apache.org/docs/latest/cluster-overview.html

https://spark.apache.org/docs/latest/cluster-overview.html

Spark is agnostic to the underlying cluster manager. The only requirement isthat Spark is able to spawn executor processes that are able to communicatewith each other. It can easily run with it even on a cluster manager that possiblyalso supports multiple other applications, for instance in the case of Mesos orYARN.

3.3.2 RDD

The key programming abstraction in Spark is Resilient Distributed Datasets(RDDs), which are fault-tolerant collections of objects partitioned over Sparknodes in the cluster and can be processed concurrently. Spark offers APIsin Java, Scala, Python, and R, through which users can pass functions to beexecuted on Spark cluster. Commonly, RDDs are first read from an externalsource, such as HDFS and then transformed using operations such as map, filter,or groupBy. RDDs are lazily evaluated, so an efficient execution plan can becreated for the transformations specified by the user.

Some operations performed in Spark will trigger a shuffle event. The shuffle isthe mechanism that Spark uses to re-distribute data across partitions so it willbe grouped differently. Commonly this includes the process of copying the dataacross the nodes and executors, which makes the shuffling costly and complex.In Spark the distribution of data across Spark nodes is typically not done inpreparation for any specific operation. A single task will operate on a singlepartition, e.g. organizing data for a single reduceByKey an all-to-all operationwill be required. This leads to Spark reading all of the values for each keyacross all of the partitions, bringing the values together in order to computethe result for each key. The performance impact of the shuffle operation issignificant. Shuffle involves expensive I/O operations on disk, CPU load fordata serialization, and I/O over the network to transmit the data to otherSpark nodes. The shuffle operations consists of map and reduce phases, not tobe confused with map and reduce Spark operations. Spark keeps the results ofindividual map tasks in memory until it does not fit, then it spills over to thedisk, sorted by target partitions and stored in a single file. Reduce tasks readthe blocks relevant to them. In addition to simple read/write interaction withthe disk, shuffle operation generates a lot of intermediate files. Depending onSpark version, these files can be preserved until the corresponding RDDs areno longer used and only then garbage collected. This can lead to long-runningSpark jobs consuming a lot of space on disk.

In order to provide fault-tolerance Spark uses lineage-based approach. In con-trast with the common strategies of replication or checkpointing, Spark tracksthe graph of transformations leading to each partition of the RDD and rerunsthe necessary operations on base data if a RDD partition is lost. For shuffleoperation the data is persisted locally on the sender node in case of receiverfailure. This approach can significantly increase the efficiency of data-intensiveworkloads.

26

Spark also provides options for explicit data sharing across computations. RDDsare not persisted by default and are recomputed, but there is a setting to persistthem in memory for rapid reuse. The data can spill to disk if it does not fitin the nodes memory, and the option to persist data to disk is also available.The data sharing can provide a large increase in speed up to several orders ofmagnitude for interactive queries and iterative algorithms that reuse data. Thepersist() and cache() methods of an RDD are used for that. Different storagelevels can be specified for this functions, for instance allowing to store RDD asserialized or deserialized Java objects in memory either with or without spillingto disk, or stored directly on the disk, with additional options that allow toreplicate partitions on another Spark node in the cluster. The storage levelsprovide the developer with a trade-off in memory and CPU usage. There areseveral factors that should be taken into consideration, when making a decisionon which storage level to use. First, if the data fits in memory it is best toleave it stored there, as this would be most CPU efficient and would allowoperations on the RDD to be executed the fastest. Second, if the data doesnot fit into memory, then trying to store it in memory, but serialized should bethe second choice. It would be more efficient in terms of space, but introducesthe serialization and deserialization overhead, while still being reasonably fast.Third, if both previous levels are impossible, the usage of disk should be a lastresort. Only if the computation of the partition is so expensive that readingfrom disk would be faster, then the data should be persisted on disk, but inmany cases the recomputation of a partition would be less costly. Finally, thereplication option should be used only for faster fault-tolerance. By default,Spark already provides fault-tolerance for all partitions on all storage levels[23].

3.3.3 Dataframe, Dataset, and Spark SQL

Unlike the basic RDD API the interface provided by Spark SQL contains sig-nificantly more information about how data is structured and how computingis performed. Dataframes are distributed collections of data added to Spark inorder to execute SQL-like queries on top of Spark engine. Datasets are an im-provement over Dataframes that combine the benefits of RDDs, such as strongtyping and powerful lambda functions and the extra optimizations that Sparkis able to do for Dataframes using the Catalyst optimizer. Datasets can be con-structed from Java objects and manipulated using functional transformations.This API is available only in Java and Scala programming languages.

27

Figure 12: Spark SQL and Hive integration. Image adapted from Apache SparkSQL25

Spark SQL can read data from existing Hive installation by communicating withthe Metastore (figure 12). Spark does not have all of the dependencies that Hiveneeds included, but if they are detected on classpath, Spark automatically loadsthem. Same goes for all of the worker nodes with respect to Hive serializers anddeserialziers (SerDe). Spark also supports starting up a Thrift server26, howeverdirect loading of files from HDFS was used.

3.4 File Formats

Storage of big data comes with additional challenges. The trade-off betweenvolume and efficiency of access becomes more important, as every improvementin compression rate may mean major differences in total requirement for stor-age space. The efficiency of access on the other hand is necessary for runninganalytical queries on data in an interactive manner. This subsection presentstwo file formats that are commonly used for storing big data.

3.4.1 ORC File Format

The Optimized Row Columnar (ORC) file format is an efficient way to storedata for Hive. It improves Hive performance by having a single output file foreach task, reducing the load on NameNode, introducing lightweight index thatis stored inside the file, concurrent reading of the same file by several readers. Inaddition it enables block-mode compression, depending on the type of data withrun-length encoding for integer types and dictionary encoding for string types.The metadata is stored by using Protocol Buffers27, allowing field addition andremoval.

25Apache Spark SQL - https://spark.apache.org/sql/26https://hortonworks.com/tutorial/spark-sql-thrift-server-example/27https://github.com/google/protobuf/wiki

28

https://spark.apache.org/sql/

Figure 13: ORC File Format. Image adapted from ORC Language Manual pagein Apache wiki29

Each ORC file consists of stripes, which represent grouped row data. Additionalinformation on the content is stored in the file footer, such as list of stripes inthe file, number of rows in a single stripe, and column-level aggregates on count,min, max, and sum. Compression parameters and the size after compressionare stored in the postscript. The stripe footer contains the directory of streamlocations.

29ORC Language Manual page in Apache wiki - https://cwiki.apache.org/confluence/

display/Hive/LanguageManual+ORC

29

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

3.4.2 Parquet File Format

Apache Parquet is another columnar file format popular in Hadoop ecosystem.The main motivation for this format is to have efficient columnar data repre-sentation with support for complex nested data structures.

Figure 14: Parquet File Format. Image adapted from Parquet documentationpage in Apache wiki31

<Column 1 Chunk 1 + Column Metadata>


...

<Column N Chunk 1 + Column Metadata>



...

<Column N Chunk 2 + Column Metadata>

...

<Column 1 Chunk M + Column Metadata>

31Parquet documentation page in Apache wiki - https://parquet.apache.org/

documentation/latest/

30

https://parquet.apache.org/documentation/latest/

https://parquet.apache.org/documentation/latest/

<Column 2 Chunk M + Column Metadata>

...

<Column N Chunk M + Column Metadata>

File Metadata

4-byte length in bytes of file metadata

4-byte magic number "PAR1"

In the above example and on figure 14, there are N columns in this table, splitinto M row groups. The file metadata contains the locations of all the columnmetadata start locations. Metadata is written after the data to allow for singlepass writing.

Parquet supports efficient compression and encoding schemas on per-columnlevel and is open for extension. Apache Parquet supports dictionary encod-ing, bit packing, and run-length encoding to provide compression to the data.Snappy32 and ZLIB33 are supported codecs. Many data processing frameworkssupport this file format, including Apache Hive, Apache Drill, Apache Impala,Apache Crunch, Apache Pig, Cassandra, Apache Spark, and Presto34.

32https://github.com/google/snappy33https://zlib.net/34https://cwiki.apache.org/confluence/display/Hive/Parquet

31

4 Experiments

The main focus of the thesis is business intelligence queries on big data. Withthis in mind, several experiments are planned. First, the benchmark is run ondata of increasing volume, to gain an understanding of how the frameworks’performance changes on growing volume of data. Second, several data formatsare examined in order to have a better understanding of how each frameworksdeals with each format, and which framework works better with which format.Finally, as the planned system works in a concurrent setting, the performanceof frameworks on a different number of concurrent users is examined.

4.1 Data

The benchmark used to assess the Big Data frameworks in this thesis is the StarSchema Benchmark [24]. The choice of the benchmark was motivated by thefact that it closely models the typical schemas used in OLAP workloads. Thebenchmark is based on TPC-H35, but makes a number of modifications to it. Ituses a star schema that is much more frequently used for interactive querying.This leads to a smaller number of joins and makes it more useful for testinginteractive business intelligence queries.

35http://www.tpc.org/tpch/

32

http://www.tpc.org/tpch/

Figure 15: Start Schema Benchmark data model

Figure 15 shows the schema used in the benchmark. The data for this schemawas generated for different volumes, defined by scale factor (SF). The tablesbelow show the data volume for scale factors 1, 10, 20, and 75 in different fileformats.

SF1 SF10 SF20 SF75Customer 2.7 Mb 27.3 Mb 54.8 Mb 206.8 MbLineorder 571.3 Mb 5.8 Gb 11.6 Gb 44.4 GbDates 224.6 Kb 224.6 Kb 224.6 Kb 224.6 KbParts 16.3 Mb 65.7 Mb 82.2 Mb 115.4 MbSupplier 162.8 Kb 1.6 Mb 3.2 Mb 12.2 MbTotal 590.8 Mb 5.9 Gb 11.8 Gb 44.7 Gb

Table 1: Data size, by table and scale factor, Text format

33

SF1 SF10 SF20 SF75Customer 722.6 Kb 7.0 Mb 14.0 Mb 52.7 MbLineorder 118.1 Mb 1.2 Gb 2.4 Gb 9.3 GbDates 10.8 Kb 10.8 Kb 10.8 Kb 10.8 KbParts 1.9 Mb 7.8 Mb 9.7 Mb 13.6 MbSupplier 48.7 Kb 473.1 Kb 946.1 Kb 3.5 MbTotal 120.9 Mb 1.2 Gb 2.5 Gb 9.3 Gb

Table 2: Data size, by table and scale factor, ORC format

SF1 SF10 SF20 SF75Customer 1.2 Mb 12 Mb 24 Mb 89 MbLineorder 172 Mb 1.7 Gb 3.4 Gb 14 GbDates 40 Kb 40 Kb 40 Kb 40 KbParts 2.5 Mb 9.3 Mb 11 Mb 16 MbSupplier 92 Kb 812 Kb 1.6 Mb 5.9 MbTotal 176 Mb 1.7 Gb 3.5 Gb 14 Gb

Table 3: Data size, by table and scale factor, Parquet format

4.1.1 Queries

Query 1 The query is meant to quantify the amount of revenue increase thatwould have resulted from eliminating certain companywide discounts in a givenpercentage range for products shipped in a given year. In each set of queriesthe specific queries are denoted as Q1.1, Q1.2, and so on.

select sum(lo_extendedprice*lo_discount) as revenue

from lineorder, date

where lo_orderdate = d_datekey

and d_year = [YEAR]

and lo_discount between [DISCOUNT] - 1

and [DISCOUNT] + 1 and lo_quantity < [QUANTITY];

There are three queries generated using this template with specific values forYEAR, DISCOUNT, and QUANTITY. For scale factor 1 the filtering of theresults will be presented. Q1.1 sets YEAR to 1993, DISCOUNT to 2, andQUANTITY to 25, which filters out approximately 11600 rows. For Q1.2 theyear and month are specified instead of year only, as January 1994. DISCOUNTis set to be between 4 and 6, and QUANTITY is set to 26 and 35, filtering outapproximately 4000 rows. For the Q1.3 the filtering is even stricter, with onlythe 6th week of the year 1994 specified for time filtering, DISCOUNT is set tobetween 5 and 7, and QUANTITY is 26 to 35, same as before. That filters outapproximately 500 rows.

34

Query 2 For a second set of queries, the restrictions are placed on two di-mensions. The query compares revenue for select product classes, for suppliersin a select region, grouped by more restrictive product classes and all years oforders.

select sum(lo_revenue), d_year, p_brand

from lineorder, date, part, supplier

where lo_orderdate = d_datekey

and lo_partkey = p_partkey

and lo_suppkey = s_suppkey

and p_category = [CATEGORY]

and s_region = [REGION]

group by d_year, p_brand1

order by d_year, p_brand1;

There are three queries generated using this template with specific values forCATEGORY and REGION. The estimated number of rows after filtering aredone for scale factor 1. Q2.1 uses MFGR#12 as CATEGORY, which accountsfor approximately 4% of the orders and sets REGION to AMERICA. Thatfilters out around 48000 rows. Q2.2 switches from a single category to a rangeof brands and sets them to be between MFGR#2221 and MFGR#2228, whilethe REGION is changed to ASIA. The added restrictions amount to around10000 being selected. Q2.3 limits the search to a single brand MFGR#2339 andchanges the REGION to EUROPE. That leads to around 1200 being displayedas a result. Each of the selections is disjoint and in addition separate from Q1,meaning no overlap and no effect of caching in these scenarios.

Query 3 In the third query suite, we want to place restrictions on three di-mensions, including the remaining dimension, customer. The query is intendedto provide revenue volume for lineorder transactions by customer nation andsupplier nation and year within a given region, in a certain time period.

select c_nation, s_nation, d_year, sum(lo_revenue)

as revenue from customer, lineorder, supplier, date

where lo_custkey = c_custkey


and lo_orderdate = d_datekey

and c_region = [REGION] and s_region = [REGION]

and d_year >= [YEAR] and d_year <= [YEAR]

group by c_nation, s_nation, d_year

order by d_year asc, revenue desc;

There are four queries generated using this template with specific values forYEAR and REGION. Q3.1 uses ASIS as REGION, leading to 20% of the rowsbeing selected, and years 1992 and 1997, bringing the total amount of rowsselected to around 205000. Q3.2 specifies a single nations instead of a regionand sets it to UNITED STATES, selecting 4%, the years are not changed. The

35

number of rows returned is approximately 8000. Q3.3 sets tighter restrictionson the location, setting it to two cities in UK and adds grouping by them to thequery. The year filtering is not changed. This nets around 320 rows. Finally,in Q3.4 the timespan is set to a single month and the location is unchanged,returning only around 5 rows.

Query 4 The last query suite represents a ”What-If” sequence, of the OLAPtype. It starts with a group by on two dimensions and rather weak constraintson three dimensions, and measure the aggregate profit, measured as (lo revenue- lo supplycost).

select d_year, c_nation, sum(lo_revenue - lo_supplycost)

as profit from date, customer, supplier, part, lineorder

where lo_custkey = c_custkey


and lo_partkey = p_partkey

and lo_orderdate = d_datekey

and c_region = [REGION]

and s_region = [REGION]

and p_mfgr = [CATEGORY]

group by d_year, c_nation

order by d_year, c_nation

There are two queries generated using this template with specific values forREGION. Q4.1 specifies AMERICA as the region and two possible categories,leading to around 96000 orders returned. Q4.2 additionally restricts time tobe 1997 or 1998, with all other filtering conditions remaining the same. Thatleads to around 27000 rows selected. Q4.3 adds a restriction on nation andadditionally restricts category to a single one, returning only approximately 500rows. The complete table of filtering factors for each query can be found in[24].

4.2 Experiment Setup

The experiments were run on a cluster of 9 nodes, each having 32 Gb RAM,12 cores, and 100 Gb SSD disks. The cluster was setup in AWS EMR 5.13.036

that runs Apache Hive 2.3.2, Apache Spark 2.3, and Presto 0.194. The versionfor ORC file format used 1.4.3. The version for Parquet file format is 1.8.2.The cluster is running Hadoop 2.8.337, with YARN acting as resource managerand HDFS as the distributed file system. As the main focus of the thesis isinteractive queries, the containers are assumed to be long-lived. The time needed

36https://aws.amazon.com/about-aws/whats-new/2018/04/support-for-spark-2-3-0-on-amazon-emr-release-5-13-0/

37https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-5x.html#emr-5130-release

36

to spin up containers is not included in the benchmark results, and only theactual query execution time is tracked. For each query a timestamp is recordedright before it is submitted and right after the results are collected the secondtimestamp is recorded. For concurrent execution several benchmarks are startedsimultaneously, and the resource waiting time impacts the runtime of the querysuite.

4.3 Performance Tuning

For Hive the following additional settings were altered:

hive.llap.execution.mode=all

hive.prewarm.enabled=true

hive.vectorized.execution.enabled = true

hive.vectorized.execution.reduce.enabled=true

hive.vectorized.execution.reduce.groupby.enabled=true

This ensured that the containers were pre-warmed and vectorization was enabledfor LLAP. LLAP mode all was chosen instead of only in order for all queries toreturn results, even if not enough resources were available for LLAP to executethe query. In addition, force-local-scheduling was set to true, in order to ensurethat all of the possible calls to HDFS were attempted locally and take advantageof data locality.

ORC file configuration was:

"orc.compress"="SNAPPY",

"orc.compress.size"="262144",

"orc.create.index"="true",

"orc.stripe.size"="268435456",

"orc.row.index.stride"="3000")

Parquet file configuration also used same parameter values for compression andindex creation.

37

5 Results

The benchmark results are presented in this section, which is divided into threesubsections. In the first subsection the single user execution is considered, inthe second subsection file format comparison for each framework is presented, inthe third section the results for concurrent executions are presented, grouped byscale of data, framework, and a separate comparison of Spark and Presto.

5.1 Single User Execution

In this section the results for a single user execution grouped by file formatand framework are presented. First, the text format results are presented, thenORC, then Parquet. In the final subsection the direct comparison betweenSpark on Parquet and Presto on ORC is presented.

5.1.1 Text format

Figure 16: Average execution timesfor text format, SF1




38

Figures 16-19 show the charts for average execution times for text format data,scale factors 1, 10, 20, and 75.

Figure 16 shows the results for text for scale factor 1. On all of the queriesPresto was the fastest, and Hive LLAP was second fastest. On queries Q1.2,Q1.3, Q2.3, Q3.3, Q3.4 the execution times for them were comparable. Sparkwas the third fastest, being slower than Hive LLAP on most of the queries.Spark was faster only on Q3.1, and tied with Hive LLAP on Q1.1 and Q4.1.Hive Tez was the slowest on all of the queries, with by far the worst executiontime on Q1.1.

Figure 17 shows the results for text for scale factor 10. Again, Presto wasthe fastest, with most of the execution times being twice as fast as the closestcompetitor. One major difference is that Spark became faster than Hive LLAPon most of the queries, with Q1.2, Q1.3, and Q2.1 being the only exceptions.Hive LLAP is faster than Hive Tez on every query except Q2.2, where theexecution times are roughly the same. Tez still is the slowest. There is still anotable spike for Q1.1, where the difference between Hive LLAP and Hive Tezexecution time is the largest.

Figure 18 shows the results for text for scale factor 20. Presto maintains itsstatus as the fastest, with only Hive LLAP almost matching the execution timesfor Q1.2 and Q1.3. Spark is the second fastest on some of the queries (Q2.1,Q4.1, Q4.2, Q4.3), but Hive LLAP matches the times for Q1.1, Q2.2 - Q3.4,and is faster for Q1.2 and Q1.3. Another key difference is that the executiontimes for Hive Tez are much closer to Hive LLAP and Spark, Q1.1 being theonly exception, where it is much slower.

Figure 19 shows the results for text for scale factor 75. Hive LLAP is the fastestframework for Q1.2, Q1.3, and Q4.3, narrowly beating Presto, which ties onQ3.4 and is faster on all other queries. Hive Tez is slower or ties with HiveLLAP on most of the queries. The most striking difference is that Spark is theslowest framework on all of the queries, which was an unexpected result, takinginto account the results of the previous runs.

39

Figure 20: Execution time growth,text format, Hive Tez

Figure 21: Execution time growth,text format, Hive LLAP

Figure 22: Execution time growth,text format, Spark

Figure 23: Execution time growth,text format, Presto

Figures 20-23 show the execution time growth for each framework on text formatdata. The data size growth corresponds to scale factors, being 1, 10, 20, and75.

Figure 20 shows the execution time growth for Hive Tez. The execution timegrowth is the slowest for Q1.2, Q1.3, and Q4.3. It steeply growth on SF1 toSF10, but stays stable for SF10, SF20, and SF75. The growth is the fastest forQ2.1, especially for SF75. The growth rate corresponds to the change of SF forall of the other queries.

Figure 21 shows the execution time growth for Hive LLAP. The growth patternis similar to Hive Tez, but in addition to Q1.2, Q1.3, and Q4.3 the queriesQ3.3 and Q3.4 show very slow execution time growth on SF10 - SF75. Again,Q2.1 shows the steepest incline in execution time, while other queries grow incorrespondence with scale factor.

Figure 22 shows the execution time growth for Spark. In contrast to Hive, Sparkshows very stable execution time growth for all of the queries. Only Q2.2 showsa slightly sharper increase in execution time in SF20 to SH75. However, the

40

absolute values for Spark are larger for SF75.

Figure 23 shows the execution time growth for Presto. Similar to Spark, Prestoshows very stable growth in execution time, closely corresponding to the growthin scale. There is no noticeable overhead and the absolute values of executiontimes are the lowest among all of the frameworks.

5.1.2 ORC

Figure 24: Average execution timesfor ORC format, SF1




Figures 24-27 show the charts for average execution times for ORC format data,scale factors 1, 10, 20, and 75.

Figure 24 shows the results for ORC for scale factor 1. Similar to text results,Presto is the fastest by far on scale factor 1. Spark is the second fastest on almostall other queries, except for Q1.1 and Q2.1, where Hive LLAP was faster. HiveTez was the slowest on all of the queries.

41

Figure 25 shows the results for ORC for scale factor 10. The results for SF 10differ substantially, with Spark and Presto being very fast, Presto being onlyslightly faster. A striking change from text format is the Hive LLAP executiontime being very similar to Tez, being faster only slightly faster on most queries.Hive Tez is still the slowest and having a noticeable slow start on Q1.1.

Figure 26 shows the results for ORC for scale factor 20. The execution timegrowth is similar to the previous chart - Presto being the fastest or having a tiewith Spark. The difference is in that Hive LLAP is slightly slower on some of theexecutions than Hive Tez. However, Hive Tez is twice as slow on Q1.1.

Figure 27 shows the results for ORC for scale factor 75. There are some inter-esting differences to note, compared to scale factor 20. Spark is the fastest onQ2.1, Q2.2, Q2.3, Q4.2, and Q4.3. Presto is still faster on Q1.1-Q1.3 and Q3.1-Q4.1. Execution times for Hive Tez and Hive LLAP are similar, with LLAPbeing slightly faster in all of the queries.

Figure 28: Execution time growth,ORC format, Hive Tez

Figure 29: Execution time growth,ORC format, Hive LLAP

Figure 30: Execution time growth,ORC format, Spark

Figure 31: Execution time growth,ORC format, Presto

Figures 28-31 show the execution time growth for each framework on ORC

42

format data.

Figure 28 shows the execution time growth for Hive Tez on ORC. The growthpattern is very similar to text format, but the rate is more stable. The executiontimes follow a very similar pattern on SF10, SF20, and SF75, without suddenspikes.

Figure 29 shows the execution time growth for Hive LLAP on ORC. Most ofthe queries have a very stable rate of growth, with Q1.2 and Q1.3 having asmall rate of growth, and Q2.1, Q3.1, Q4.1, and Q4.2 taking much longer time,growing at a rate corresponding to data volume increase.

Figure 30 shows the execution time growth for Spark on ORC. There are a fewinteresting features of the chart. The growth is very slow on SF1-SF10-SF20,much slower that the growth rate of data volume. This trend continues forQ1.1-Q2.3 and Q4.2, Q4.3 to SF75, showing a very small increase. In contrast,Q3.1-Q4.1 show a very sharp surge in execution time for SF75.

Figure 31 shows the execution time growth for Presto on ORC. Presto showsthe most consistent growth pattern with respect to data volume growth. Thedifference in individual query times amplify, but the pattern is the same for allscales.

43

5.1.3 Parquet

Figure 32: Average execution timesfor Parquet, SF1




Figures 32-35 show the charts for average execution times for Parquet data,scale factors 1, 10, 20, and 75.

Figure 32 shows the results for Parquet for scale factor 1. There are severalinteresting features on this chart. First, Spark and Presto show a considerablybetter execution time compared to Hive. On Q1.1 - Q2.3 Presto is faster or tiedwith Spark, with the most difference being on Q1.1. However, on Q3.1 - Q4.3the situation is reversed with Spark being the fastest.

Figure 33 shows the results for Parquet for scale factor 10. The trend from SF 1continues, with Spark and Presto being definite leaders, but now Presto startsto get slower. On Q1.1 - Q1.3 it still maintains the lead, but on all other queriesSpark is definitely faster.

Figure 34 shows the results for Parquet for scale factor 20. The same patternfrom the previous chart continues, as Spark execution times grow much slower

44

with the increased data volume. Only Q1.1 shows Presto being faster thanSpark.

Figure 35 shows the results for Parquet for scale factor 75. Starting from thischart Spark becomes the leader on all queries. In addition, Presto executiontimes are close to Hive LLAP on Q2.2 and Q2.3, for the first time in Parquetexperiments.

Figure 36: Execution time growth,Parquet, Hive Tez

Figure 37: Execution time growth,Parquet, Hive LLAP

Figure 38: Execution time growth,Parquet, Spark

Figure 39: Execution time growth,Parquet, Presto

Figures 36-39 show the execution time growth for each framework on Parquetdata. The data size growth corresponds to scale factors, being 1, 10, 20, and75.

The charts for Hive and Presto show a similar picture of growth that correspondswith the data volume growth. Spark, however, shows a radically different picture- execution time growth much slower for it, showing only a slight increase withthe data volume growth.

45

5.1.4 Spark Parquet and Presto ORC

Presto shows the best time on ORC and Spark shows the best time on Parquet,and in this section they are directly compared side by side.

Figure 40: Spark Parquet andPresto ORC average executiontimes for SF1




Figures 40 - 43 show the average execution times for Spark on Parquet data andPresto on ORC data. The data size growth corresponds to scale factors, being1, 10, 20, and 75.

Figure 40 shows the comparison on SF1 data. Spark is slower on all of thequeries, and it is evident that Q1.1 takes a lot more time for it to process. Q2.1also takes more time than others.

Figure 41 shows the comparison on SF10 data. The situation is quite different, asSpark is now faster on Q2.1 - Q4.3, with only the first three queries being fasteron Presto. Q1.1 still shows the overhead that the first execution on a containerintroduces in Spark, while Presto does not display that behaviour.

46

Figure 42 shows the comparison on SF20 data. The situation is not very differentabstractly, but the time difference on Q2.1 - Q4.3 grows substantially.

Figure 43 shows the comparison on SF75 data. The difference becomes apparent,as Presto becomes much slower than Spark on Q1.2 - Q4.3.

5.2 File Format Comparison

Figure 44: Average execution timesfor Hive Tez, SF1




Figures 44-47 show the average execution times for Hive Tez on ORC, Par-quet, and text format data, grouped by scale factor. All of the charts show animprovement in execution time.

Figure 44 shows the execution times for Hive Tez on ORC, Parquet, and textfiles for SF 1. For text and ORC the biggest difference in execution time is onQ1.1, Q1.3, and Q4.3, where the time for ORC is roughly two times less thanfor text. An interesting feature of this chart is that on Q3.2 and Q3.3 ORCexecution time is larger than text. On most of other queries the execution times

47

are similar. Most of the queries are faster on Parquet than text and ORC, withthe exception of Q1.1 and Q4.1, where ORC is faster.

Figure 45 shows the execution times for Hive Tez on ORC, Parquet, and textfiles for SF 10. For text and ORC in contrast to the previous chart, the ORCexecution time is smaller on every query. Most of the queries show an improve-ment of 25-50%. In contrast to the previous chart, Parquet is slower than ORCon Q1.1 - Q1.3, and Q4.3. On the other queries the difference is similar incomparison to ORC, and greater in comparison to text.

Figure 46 shows the execution times for Hive Tez on ORC, Parquet, and textfiles for SF 20. For text and ORC it is similar to SF1, almost all queries arefaster for ORC, with the exception of Q3.1. The difference in execution time issmaller than in SF10, but still present, especially on Q1.1 - Q1.3 and Q3.2 - Q4.2.The trend with Parquet execution times becoming slower than ORC continueswith Q3.1, Q4.1 and Q4.2 being slower on Parquet than ORC. The differencebetween Parquet and text closes, with Q3.1 being the fastest on text.

Figure 47 shows the execution times for Hive Tez on ORC, Parquet, and textfiles for SF 75. In contrast to the two previous charts, for text and ORC, ORCis slower than text on Q3.2-Q4.2. However, the trend with Q1.1-Q1.3 beingsubstantially faster continues. On almost every other query the execution timesare comparable. In terms of Parquet the situation is somewhat reversed, withonly Q1.1 - Q1.3 and Q3,1 being slower on Parquet, other queries show Parquetto be faster than ORC, and in some cases quite substantially. For instance, Q2.1- Q2.3 and Q3.3 - Q3.4 being 50% faster on Parquet.

48

Figure 48: Average execution timesfor Hive LLAP, SF1




Figures 48-51 show the average execution times for Hive LLAP on ORC, Par-quet, and text format data, grouped by scale factor.

Figure 48 shows the execution times for Hive LLAP on ORC, Parquet, and textfiles for scale factor 1. There are a few interesting features about it: similar toTez, LLAP on ORC shows a marked improvement on Q1.1, especially comparedto Parquet. However, on queries Q2.1 - Q2.3 and Q3.2 - Q3.4 ORC is slowerthan text and on Q2.1 - Q2.3 and Q3.2 - Q3.4, and Q4.2 it is slower thanParquet.

Figure 49 shows the execution times for Hive LLAP on ORC, Parquet, and textfiles for scale factor 10. In contrast to the previous chart the execution times forORC are lower on all of the queries. The difference for many queries is around50%. The only queries having similar execution times are Q2.1 and Q3.1. Interms of Parquet, there is an improvement, as on queries Q2.1 - Q4.2 it is tiedor faster than ORC and text. On Q1.1 - Q1.3 and Q4.3 ORC is still the fastestformat.

Figure 50 shows the execution times for Hive LLAP on ORC, Parquet, and

49

text files for scale factor 20. The situation is similar to Tez, as the queries arestill faster for ORC than for text with the exception of Q3.1. The trend withParquet is similar, as Q1.1 - Q1.3 ORC is faster, but on all the other queriesthe difference is smaller, but Parquet is still slightly faster.

Figure 51 shows the execution times for Hive LLAP on ORC, Parquet, and textfiles for scale factor 75. The pattern of execution times closely resembles thesame scale factor for Tez, figure 47. Q1.1 - Q1.3 are substantially faster onORC, while Q3.2 - Q4.3 are faster on Parquet.

Figure 52: Average execution timesfor Spark, SF1




Figures 52-55 show the average execution times for Spark on ORC, Parquet,and text format data, grouped by scale factor. There is a major difference fromthe same charts for Hive, as Parquet is clearly faster on every query.

Figure 52 shows the execution times for Spark on ORC, Parquet, and textfiles for scale factor 1. Parquet is faster on all of the queries, however thetime difference is smaller on Q1.1 - Q1.3, and Q3.3 - Q3.4, compared to otherqueries where the times for Parquet were two or more times smaller than for

50

ORC.

Figure 53 shows the execution times for Spark on ORC, Parquet, and text filesfor scale factor 10. Similar to scale factor 1, the execution times for Parquetare smaller than for text and ORC, but the difference is in the scale of it - heremost of the times are 70-80% faster than for text for ORC and these times arefurther improved on Parquet.

Figure 54 shows the execution times for Spark on ORC, Parquet, and text filesfor scale factor 20. The same trend continues with scale factor 20, with most ofthe ORC times being 80-90% faster on most of the queries for text and Parquetbeing even faster with the same improvement but from ORC.

Figure 55 shows the execution times for Spark on ORC, Parquet, and text filesfor scale factor 75. This chart has some interesting features. While some ofthe query times (Q1.1 - Q2.3 and Q4.2, Q4.3) still show a major difference inruntime on the scale of the previous charts, queries Q3.1 - Q4.1 show much lessof a difference, where ORC times are around a third of the times for text. Butthe trend for Parquet show a remarkable growth pattern, as it is substantiallysmaller compared to ORC and text.

51

Figure 56: Average execution timesfor Presto, SF1




Figures 56-59 show the average execution times for Presto on ORC, Parquet,and text format data, grouped by scale factor. The pattern of execution timesis even more consistent than for Spark in terms of ORC and text. For all ofthe queries ORC is faster, being more than twice as fast as text and slightlyfaster than Parquet. This trend does not change for SF1, SF10, SF20, and SF75showing a remarkably consistent picture.

5.3 Concurrent Execution

In this section the results for concurrent execution are presented. First sub-section groups them by scale factor of data, second subsection groups them byframework, and the final subsection has a direct comparison of Spark on Parquetand Presto on ORC.

52

5.3.1 Grouped by Scale Factor

Figure 60: Average execution timesfor 1 user, SF1

Figure 61: Average execution timesfor 2 users, SF1

Figure 62: Average executiontimes for 5 users, SF1

Figures 60-62 show the average execution times for SF1 data in ORC formatgrouped by framework for 1, 2, and 5 users.

Figure 60 shows the execution time for SF1 ORC data with a single user for allframeworks. Presto is the fastest by far on scale factor 1. Spark is the secondfastest on almost all other queries, except for Q1.1 and Q2.1, where Hive LLAPwas faster. Hive Tez was the slowest on all of the queries.

Figure 61 shows the execution time for SF1 ORC data with two concurrent usersfor all frameworks. Presto maintains the status of the fastest framework withease, being two or more times faster than Spark. Spark easily holds the secondplace, with Hive LLAP and Hive Tez being much slower. This is different froman execution with a single user, where Hive LLAP was closer to Spark on severalqueries and outperforming it on two.

Figure 62 shows the execution time for SF1 ORC data with five concurrentusers for all frameworks. Once again, Presto is the indisputable leader withmore than twice faster execution times than Spark. The difference in executiontimes between Spark and Hive LLAP shortened, with some of the queries having

53

almost the same results (Q1.2, Q1.3, Q4.1, Q4.3). Hive Tez is the slowest. Anotable feature is the overhead for both Hive Tez and LLAP on Q1.1, whichtakes more than three times the time that Spark requires for it.





Figure 63 shows the execution time for SF10 ORC data with a single user forall frameworks. The results for SF 10 differ substantially from SF 1 (figure 60),with Spark and Presto being very fast, Presto being only slightly faster. HiveTez is still the slowest and having a noticeable slow start on Q1.1.

Figure 64 shows the execution time for SF10 ORC data with two concurrentusers for all frameworks. The situation has several changes from the singleconcurrent user, as the difference Spark and Presto became even smaller, withSpark being slightly faster on several queries. In addition, Hive LLAP and HiveTez have very similar execution times on most of the queries.

Figure 65 shows the execution time for SF10 ORC data with five concurrentusers for all frameworks. There are a few key difference compared to the previouschart. First, Spark is two times faster than Presto on most of the queries (Q2.1

54

- Q4.3), with Presto maintaining the lead on Q1.1 - Q1.3. Second, Hive LLAPexecution times are definitely smaller than Hive Tez on every query.





Figure 66 shows the execution time for SF20 ORC data with a single user for allframeworks. Presto is the fastest or has a tie with Spark. The difference fromscale factor 10 is in that Hive LLAP is slightly slower on some of the executionsthan Hive Tez. However, Hive Tez is two times slower on Q1.1.

Figure 67 shows the execution time for SF20 ORC data with two concurrentusers for all frameworks. This chart has several key differences. Presto becomesslower than Spark on Q2.1 - Q4.3. Hive Tez execution times grow very fastcompared to Hive LLAP, especially on Q3.2 - Q4.3.

Figure 68 shows the execution time for SF20 ORC data with five concurrentusers for all frameworks. The results for five users reinforce the results for twousers. Spark has comparable (Q1.1 - Q1.3) or faster (Q2.1 - Q4.3) executiontimes than Presto on all of the queries. Hive LLAP outperforms Hive Tez onevery query, but the difference is the largest on Q3.1 - Q4.3.

55

5.3.2 Grouped by Framework




Figures 69-71 show the average execution times for Hive Tez on ORC formatdata grouped by scale factor.

Figure 69 shows the execution times for 1, 2, and 5 users for Hive Tez for ORCdata on scale factor 1. A notable feature of this chart is the growth in executiontimes for going from 2 users to 5. Execution time for some queries took morethan five times the time than for for two concurrent users.

Figure 70 shows the execution times for 1, 2, and 5 users for Hive Tez for ORCdata on scale factor 10. There are several interesting feature in this chart,compared to scale factor 1. First, the growth in execution time for going from1 to 2 users seems to closely resemble the first chart, however, the situationwith 5 users is very different. For some of the queries it seems that the growthwas much slower, e.g. Q3.1 - Q4.3 show slow growth. For other queries theexecution time growth more similar to scale factor 1, e.g. Q1.1, Q2.1 - Q2.3.Q1.2 and Q1.3 seem to grow much slower than others.

Figure 71 shows the execution times for 1, 2, and 5 users for Hive Tez forORC data on scale factor 20. The pattern of growth become more stable,

56

compared to the scale factor 10. The only outlier is Q3.1 that seems to requiresignificantly more time to execute for 5 users, while most other queries have amore predictable execution time growth.




Figures 72-74 show the average execution times for Hive LLAP on ORC formatdata grouped by scale factor.

Figure 72 shows the execution times for 1, 2, and 5 users for Hive LLAP forORC data on scale factor 1. Most of the query execution times show a muchmore stable growth time pattern, compared to Hive Tez, with the exception ofQ1.1 that seems to have a lot of overhead from concurrent executions.

Figure 73 shows the execution times for 1, 2, and 5 users for Hive LLAP forORC data on scale factor 10. There are differences from the previous chart.The growth in execution time is stable, including the Q1.1. Queries Q1.2 andQ1.3 have a much slower growth compared to others. Other queries show ex-ecution time growth corresponding to the change in the number of concurrentusers.

Figure 74 shows the execution times for 1, 2, and 5 users for Hive LLAP forORC data on scale factor 20. The chart for scale factor 20 has the same growth

57

patterns as the one for scale factor 10.




Figures 75-77 show the average execution times for Spark on ORC format datagrouped by scale factor.

Figure 75 shows the execution times for 1, 2, and 5 users for Spark for ORCdata on scale factor 1. Spark shows a very different rate of growth comparedto Hive. For transition from a single user to two concurrent users Spark hasalmost no increase in runtime. For the increase to five concurrent users the rateof growth is smaller than for Hive on most of the queries.

Figure 76 shows the execution times for 1, 2, and 5 users for Spark for ORCdata on scale factor 10. The trend is even more visible for scale factor 10 forSpark. It shows only slight increase in execution time from a single user to twoto five on a scale factor 10 data.

Figure 77 shows the execution times for 1, 2, and 5 users for Spark for ORCdata on scale factor 20. Execution time for scale factor 20 show an even moreimpressive results, showing almost no difference for the three executions.

58




Figures 78-80 show the average execution times for Presto on ORC format datagrouped by scale factor.

Figure 78 shows the execution times for 1, 2, and 5 users for Presto for ORCdata on scale factor 1. On scale factor 1 Presto shows a growth pattern similarto Hive LLAP, with most queries having steady growth corresponding with theincrease in concurrent users. The exception is Q1.1 that shows an overhead forconcurrent execution.

Figure 79 shows the execution times for 1, 2, and 5 users for Presto for ORCdata on scale factor 10. The execution time growth gets worse with the increasein volume. However, execution time grows slower than the number of concurrentusers.

Figure 80 shows the execution times for 1, 2, and 5 users for Presto for ORCdata on SF 20. Scale factor 20 reinforces the pattern established by scale factor10, but the execution time growth gets slightly worse for 5 users.

59

5.3.3 Presto ORC and Spark Parquet on concurrent users

Figure 81: Average execution timesfor Spark for concurrent users,SF10, Parquet

Figure 82: Average execution timesfor Presto for concurrent users,SF10, ORC

Figure 83: Total execution timegrowth for Spark and Presto forconcurrent users, SF10

Figures 81 and 82 show the average execution times for Spark on Parquet andPresto on ORC for concurrent users on scale factor 10 data volume. There isa very different growth pattern for the frameworks. Spark shows a very slowgrowth, while Presto shows a similar growth for up to 5 users, but has muchslower execution for 10 and 25 users. Figure 83 shows the total execution timegrowth for Spark and Presto. It is evident that Spark shows much slower growth,while Presto has a dramatic increase in execution time.

60

6 Conclusions

In this section the conclusions and future work are presented. First the con-clusions for Single User Execution are presented, in which the performance ofeach framework with only a single user is examined. Then the Fiel FormatComparison results are outlined, in which for every framework the performanceon different file formats is examined. Finally, in Concurrent Execution the per-formance of the frameworks with multiple concurrent users is presented.f

6.1 Single User Execution

Presto is the fastest framework for queries with a single concurrent user on ORCin most conditions, while Spark is the fastest on Parquet for a single user. Sparkand Hive LLAP are second fastest on ORC in different situations. On text dataHive LLAP maintained a lead on Spark, especially on scale factor 75 (figure 19).On ORC data the picture is quite different. Presto was definitely fastest on SF1 and scale 75, but Spark competed with it closely on scales 10 and 20. HiveLLAP performance gain was less than for Spark and Presto, making it the thirdfastest on ORC. On Parquet the situation is substantially different with Sparkbeing the absolute leader. The difference is especially visible on figures 42 and43. Presto execution time growth is much faster than Spark, and of SF75 itbecomes very apparent.

The execution time growth patterns on both text and ORC show a similarpicture. For Hive Tez and Hive LLAP the changes in query execution timesvaries more (figures 20, 21, 28, 29) than for Presto and Spark (figures 22, 23,30, 31). Both show a stable execution time growth pattern that is consistentwith the data volume growth with the increasing scale factor. Some queriescircumvent this pattern, having much slower growth in execution time thanothers, especially on ORC format. On Hive and Presto Parquet (figures 36,37, 39) shows similar growth pattern to ORC (figures 28, 29, 31), while Sparkshows a slow growth, compared to other frameworks (figure 38), leading to aconclusion that Spark will perform better with growing data volume, while thecluster RAM can handle it. Overall, the results show that single user executionwas most performant with Spark on Parquet for SF10 - SF75 (figures 41, 42,and 43), and with Presto on ORC for SF1 (figure 40).

6.2 File Format Comparison

In terms of ORC and text execution time comparisons, Hive showed unexpectedresults. Hive Tez performed faster on ORC for scales 1, 10, and 20, but for scale75 it showed a better time on some of the queries on text files (figures 44 -47). Hive LLAP showed better time for ORC on scales 10 and 20 (figures 49and 50), but had mixed results for SF 1 and 75 (figures 48 and 51). Spark

61

and Presto performed faster on ORC, with a substantial reduction of executiontime, especially Spark on SF 10 and 20 (figures 52 - 59).

For Parquet the trends are similar for Hive on ORC, where the times for Parquetand ORC (figures 44 - 51) are similar in most cases. For Spark the situation isvery different, as Parquet shows virtually no growth compared to text and ORCwith the data volume growth (figures 52 - 55). Presto shows similar growth onORC and Parquet, being very consistent (figures 56 - 59).

Overall, the decision on which format to pick is very dependent on the choiceof framework. On ORC Presto showed the best execution time, while Sparkshowed the second best time, but the growth pattern on scale did not lookvery good. On Parquet Spark easily outperformed other frameworks and theexecution time growth was much slower, leading to better performance on largedata volumes. In addition, the data volume after compression can be takeninto account, as shown in tables 1 - 3. On the scale factor 75 the difference isalready more than 40%, with ORC providing smaller data volume. With strictconstraints on the data volume storage, it is possible that the compression factorwill be a major part of the decision. To conclude, the experiments showed thatSpark on Parquet is the most performant choice, but Presto on ORC can be asolid option and under some conditions.

6.3 Concurrent Execution

On concurrent execution the fastest framework for extremely small scale wasPresto (scale factor 1), while Spark outperformed it on scales 10, 20, and 75on ORC and Parquet. In addition to being the fastest, Spark shows a slowgrowth rate with the increase in concurrent users with slight increase in pro-cessing time and very small overhead. Presto was faster on small scale data, butthe overhead from concurrent users made it slower with growth in data volumeand concurrency. Spark shows great isolation between users, as is evident onexecution growth time charts, with very slight execution time increase for thegrowing number of concurrent users. The difference is especially evident, whencomparing Presto on ORC and Spark on Parquet side by side. Total executiontime greatly increased with more than 5 users for Presto, while Spark main-tains a steady execution time even with 25 concurrent users. During the runsit seemed that Presto encountered more resource management problems, withthe times for queries varying significantly. Spark, on the other hand seemedto execute queries at roughly the same time, and not using all of the clusterresources for every query.

6.4 Future Work

There are several possible directions for future work. First, it is possible toperform an examination the resource usage on queries in order to determine the

62

possible bottlenecks that different frameworks encounter. Second, file formatscan be explored in more detail, with possible inclusion of RCFile38 or Avro39

formats in the comparison. Lastly, another avenue for future work is to assessthe additional overhead from resource provisioning by YARN and include it intothe results, which could impact the results in concurrency.

38https://cwiki.apache.org/confluence/display/Hive/RCFile39https://avro.apache.org/

63

References

[1] Big Data Statistics and Facts for 2017 report by Waterford Technolo-gies;. Accessed: 2018-05-12. https://www.waterfordtechnologies.com/big-data-interesting-facts/.

[2] DB-Engines Ranking;. Accessed: 2018-05-12. https://db-engines.com/

en/ranking.

[3] The Business Intelligence on Hadoop Benchmark Q4 2016 reportby atScale;. Accessed: 2017-12-24. http://info.atscale.com/

atscale-business-intelligence-on-hadoop-benchmark.

[4] Xue R. SQL Engines for Big Data Analytics [G2 Pro gradu, diplomityo];2015. Available from: http://urn.fi/URN:NBN:fi:aalto-201512165719.

[5] Li X, Zhou W. Performance Comparison of Hive, Impala and Spark SQL.In: Proceedings of the 2015 7th International Conference on IntelligentHuman-Machine Systems and Cybernetics - Volume 01. IHMSC ’15. Wash-ington, DC, USA: IEEE Computer Society; 2015. p. 418–423. Availablefrom: http://dx.doi.org/10.1109/IHMSC.2015.95.

[6] Li M, Tan J, Wang Y, Zhang L, Salapura V. SparkBench: A Compre-hensive Benchmarking Suite for in Memory Data Analytic Platform Spark.In: Proceedings of the 12th ACM International Conference on ComputingFrontiers. CF ’15. New York, NY, USA: ACM; 2015. p. 53:1–53:8. Availablefrom: http://doi.acm.org/10.1145/2742854.2747283.

[7] Ivanov T, Beer M. Evaluating Hive and Spark SQL with BigBench.CoRR. 2015;abs/1512.08417. Available from: http://arxiv.org/abs/

1512.08417.

[8] Ousterhout K, Rasti R, Ratnasamy S, Shenker S, Chun BG. Mak-ing Sense of Performance in Data Analytics Frameworks. In: 12thUSENIX Symposium on Networked Systems Design and Implementa-tion (NSDI 15). Oakland, CA: USENIX Association; 2015. p. 293–307. Available from: https://www.usenix.org/conference/nsdi15/

technical-sessions/presentation/ousterhout.

[9] Hakansson A. Portal of research methods and methodologies for researchprojects and degree projects. In: Proceedings of the International Confer-ence on Frontiers in Education: Computer Science and Computer Engineer-ing (FECS). The Steering Committee of The World Congress in ComputerScience, Computer Engineering and Applied Computing (WorldComp);2013. p. 1.

[10] Apache Hadoop project;. Accessed: 2017-11-24. http://hadoop.apache.

org/.

[11] Dorband J. Commodity Cluster Computing for Remote Sensing Applica-tions using Red Hat LINUX; 2003.

64

https://www.waterfordtechnologies.com/big-data-interesting-facts/

https://www.waterfordtechnologies.com/big-data-interesting-facts/

https://db-engines.com/en/ranking

https://db-engines.com/en/ranking

http://info.atscale.com/atscale-business-intelligence-on-hadoop-benchmark

http://info.atscale.com/atscale-business-intelligence-on-hadoop-benchmark

http://urn.fi/URN:NBN:fi:aalto-201512165719

http://dx.doi.org/10.1109/IHMSC.2015.95

http://doi.acm.org/10.1145/2742854.2747283

http://arxiv.org/abs/1512.08417

http://arxiv.org/abs/1512.08417

https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/ousterhout

https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/ousterhout

http://hadoop.apache.org/

http://hadoop.apache.org/

[12] Dean J, Ghemawat S. MapReduce: simplified data processing on largeclusters. Communications of the ACM. 2008;51(1):107–113.

[13] Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop Distributed FileSystem. In: Mass storage systems and technologies (MSST), 2010 IEEE26th symposium on. Ieee; 2010. p. 1–10.

[14] Ghemawat S, Gobioff H, Leung ST. The Google File System. vol. 37. ACM;2003.

[15] Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R,et al. Apache Hadoop YARN: Yet another resource negotiator. In: Pro-ceedings of the 4th annual Symposium on Cloud Computing. ACM; 2013.p. 5.

[16] Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, et al. Hive:a warehousing solution over a map-reduce framework. Proceedings of theVLDB Endowment. 2009;2(2):1626–1629.

[17] Huai Y, Chauhan A, Gates A, Hagleitner G, Hanson EN, O’Malley O,et al. Major Technical Advancements in Apache Hive. In: Proceedings ofthe 2014 ACM SIGMOD International Conference on Management of Data.SIGMOD ’14. New York, NY, USA: ACM; 2014. p. 1235–1246. Availablefrom: http://doi.acm.org/10.1145/2588555.2595630.

[18] Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C. Apachetez: A unifying framework for modeling and building data processing ap-plications. In: Proceedings of the 2015 ACM SIGMOD international con-ference on Management of Data. ACM; 2015. p. 1357–1369.

[19] Song H, Dharmapurikar S, Turner J, Lockwood J. Fast hash table lookupusing extended bloom filter: an aid to network processing. ACM SIG-COMM Computer Communication Review. 2005;35(4):181–192.

[20] Shainman M. The Power of Presto Open Source for the Enterprise. Tera-data; 2016.

[21] Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, et al.Apache Spark: A Unified Engine for Big Data Processing. Commun ACM.2016 Oct;59(11):56–65. Available from: http://doi.acm.org/10.1145/

2934664.

[22] Apache Spark Survey 2016;. Accessed: 2017-12-12. http:

//cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/

2016_Spark_Survey/2016_Spark_Infographic.pdf?t=1520448521980.

[23] Kleppmann M. Designing Data-Intensive Applications: The Big Ideas Be-hind Reliable, Scalable, and Maintainable Systems. O’Reilly Media; 2017.Available from: https://books.google.se/books?id=BM7woQEACAAJ.

[24] O’Neil PE, O’Neil EJ, Chen X. The Star Schema Benchmark (SSB); 2009.

65

http://doi.acm.org/10.1145/2588555.2595630

http://doi.acm.org/10.1145/2934664

http://doi.acm.org/10.1145/2934664

http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/2016_Spark_Survey/2016_Spark_Infographic.pdf?t=1520448521980



https://books.google.se/books?id=BM7woQEACAAJ

Hive, Spark, Presto for Interactive Queries on Big Data1247796/FULLTEXT01.pdf · to analyze data...

Documents

Transcript of Hive, Spark, Presto for Interactive Queries on Big Data1247796/FULLTEXT01.pdf · to analyze data...