Machine learning with big data in the Hadoop Ecosystem for...

Machine learning with big data in the Hadoop Ecosystem for

Scientific Computing

Suhua Wei, Yong Yu

Abstract

With an unprecedented and exponentially growing amount of data available to research

communities at various fields of scientific inquiries, it is a great challenge in the data processing

architectures and application software with regards to selecting machine learning tools for

research purposes. This proposal aims to help scientific researchers without big data background

in understanding the big data architecture for machine learning in application of Hadoop open

source tools. Through the discussion of advantages and drawbacks for choices of tools, this

proposal aims to show the roadmap to scientific researchers for smart decisions making. In

addition, this proposal may serve as a starting point in building scientific toolkits tuned for

specific purposes with regards to machine learning application of Hadoop ecosystem.

I. Introduction

1.1 Hadoop ecosystem.

Hadoop is an open-source project for reliable, scalable, distributed computing. The Apache

Hadoop software library is a framework that allows for the distributed processing of large data

sets across cluster of computers [1]. Hadoop was created by Doug Cutting and named after his

son’s toy elephant.

Hadoop is a framework that is the core of a rapidly growing ecosystem in which a number

of providers are building Platforms based on Hadoop. Following figure showed the timeline of

Hadoop ecosystem evolution.

Fig 1. Timeline of major events in Hadoop. Courtesy of

http://www.cs.clemson.edu/sserg/iwseco/2013/

Around 2004 two formative papers were published by Google authors[2][3]. The former

defined the Google file system and the latter defined of the architecture an innovative approach

called Map-Reduce intended for processing data. Hadoop was initially housed in the Nutch

Apache project, but split off to become an independent Apache project in 2006. In January 2010,

Google was granted a patent that covers the MapReduce algorithm. Afterwards, the use of

Hadoop has grown rapidly.

As time goes on, the initial Hadoop project have matured and spun off to become

independents projects and provided continuously improving products: Avro, HBase, Hive, Pig,

Flume, Sqoop, Oozie, HCatalog and Zookeeper etc [4]. Of courses, the on-going Hadoop

projects and products are not limited to those we mentioned here. In this survey paper, we do not

attempt to introduce those products one by one. Instead, we aim to give the beginners a

systematic introduction of Hadoop ecosystem and its products, under the framework of data

storage, data processing, data access and data management as shown in Fig 2. At each category,

we will give a brief introduction to one or two representative products, exploring both their

advantages and disadvantages.

http://www.cs.clemson.edu/sserg/iwseco/2013/

Fig 2. Hadoop Ecosystem. Courtesy of http://revieweasyhomemadecookies.com/what-is-hadoop-

ecosystem/

1.2 Machine learning

Machine learning focuses on prediction-making through the use of the mathematic optimization

on large amount of data. There are three most common categories in applications of machine

learning algorithms: supervised learning, in which the algorithms will give a desired output,

usually is numerical or labeled; unsupervised learning, which is to discover the hidden patterns

in the data, and reinforcement learning, which is to provide feedback by implementing rewards

and punishments towards achieving certain goals. Other categorizations of machine learning

considering the output of a machine learning system include: classification, regression, clustering,

density estimation and dimensionality reduction.

1.3 Applications of machine learning and big data in scientific research

Computational modeling has been widely used as powerfully tools in various scientific research

communities. From unveiling chemical processes, such as a catalyst’s purification of exhaust

fumes to high-energy particle accelerators, pushing the knowledge limit of human beings on the

universe’s expansion and dark energy using data generated by Hubble Space Telescope, high-

throughput DNA sequencers, or processing images collected by earth observatory satellites. Each

such scientific instrument, as well as a host of others, is critically dependent on computing for

sensor control, data processing, international collaboration, and access. As with successive

http://revieweasyhomemadecookies.com/what-is-hadoop-ecosystem/

http://revieweasyhomemadecookies.com/what-is-hadoop-ecosystem/

generations of other large-scale scientific instruments, each new generation of advanced

computing brings new capabilities, along with technical design [5].

2. Problem description

Data-generation capabilities in most science domains are growing more rapidly than computing

capabilities, causing these domains to become data-intensive. High-end data analytics (big data)

and high-end computing (extra scale) are both essential elements of an integrated computing-

research-development [5]. Research and development of next generation algorithms, software,

and applications is as crucial as investment in devices and hardware. Therefore, fostering

development of a technology ecosystem of research-oriented computing and analysis

tools/products is also important. This proposal attempts to disclose potential solutions in the

technique challenges in computing-research areas regarding machine learning applications in the

middleware and management level, as well as at application level based on Hadoop ecosystem.

3. Survey on the related work

3.1 Hadoop data analytics ecosystem

3.1.1 Data Storage

The representative Hadoop data storage projects are Hadoop Distributed File System

(HDFS) and HBase. Yahoo! has developed and contributed to 80% of the core of Hadoop

(HDFS and MapReduce). HDFS is designed to store very large dataset reliably, and to stream

those data sets at high bandwidth to user applications. An important characteristic of Hadoop is

the partitioning of data and computation across many (thousands) of hosts, and executing

application computations in parallel close to their data. A Hadoop cluster scales computation

capacity, storage capacity and IO bandwidth by simply adding commodity servers. HDFS stores

file system metadata and application data separately [6].

The major components of HDFS architecture include NameNode, DataNode and Clients.

The HDFS namespace is a hierarchy of files and directories. Files and directories are represented

on the NameNode by inodes. The files contents is split into large blocks and each block of file is

independently replicated at multiple DataNodes. The NameNodes maintans the namespace tree

and the mapping of file blocks to DataNodes.

Each block replica on DataNode is represented by two files in the local’s host native file

system. The first files contain the data itself and the second file is block’s metadata including

checksums for the block data and the block’s generation stamp.

HDFS Client is a code library that exports the HDFS file system interface. HDFS supports

operations to read, write, and delete files and operations to create and delete directories. Unlike

conventional file systems, HDFS provides an API that exposes the locations of files blocks. This

allows applications like Mapreduce framework to schedule a task to where the data are located

[6].

HBase was originally developed at Powerset, now a department at Microsoft[4]. HBase is

an open distributed database modeled after Google’s Bigtable and is written in Java. It is a

column-oriented database management system that’s runs on top of HDFS. It is well suited for

sparse data sets. It doesn’t support a structured query language like SQL, so it is not relational

database. A HBase systems contains tables. Each table has rows and columns, with an element

defined as primary key, and all access attempts to HBase table must use this primary key.

Hadoop is mainly meant for batch processing, in which data will be accessed only in a sequential

manner, where HBase is used for quick random access of huge data.

3.1.2 Data Processing

Since 1970s, the study on parallel processing technologies attracted more and more

attention of the “I/O bottleneck” issue caused by the different bandwidth between processors and

memory. As a result, the parallel DBMSs are widely researched by processing huge data on a

cluster of computers [7]. The novel Mapreduce plateform achieves success as it can provide a

flexible infrastructure, storing data distributive on a scalable system with a simple and flexible

style.

“MapReduce is a programming model and an associated implementation for processing and

generating big data sets with a parallel, distributed algorithm on a cluster”[8]. Users have to

implement their special map function and reduce function to process a key/value pair and

mergers all intermediate values associated with the paired key. The programmer does not need to

care about the details of partitioning the input data, the execution machines, machine failures and

inter-machine communication, the run-time system would handle all of the details. The idea of

the function model is inspired by the map and reduces primitives present in LISP. The new

abstraction of mapreduce is to express the simple computation: hides the messy details of

parallelization, fault-tolerance, data distribution and load balancing in a library.

The process of Mapreduce action includes:1, The input files are splited into M pieces,

usually 16 ~ 64M per piece; 2, Master assigns works to workers; 3, Worker reads the contents of

the input split; 4, The buffered pairs are written to local disk; 5, Master read the buffered data,

reduce works sort all intermediate data, group key and value; The reduce worker passes the key

and values to reduce function; 7, Master wakes up the user program

The performance of the Mapreduce is powerful: for example, it just took about 100

seconds to grep a 92,337 records of three-character pattern in 1010

100-byte records on a 1800

machines’ cluster.

Mapreduce is a powerful and simple framework that has been used a cross wide range of

domains: Large-scale machine learning problems; Clustering problems for Google news and

Froogle products; Extraction of data used to product reports of popular queries; Extraction of

properties of web pages for new experiments and products; Large-scale graph computations .

https://en.wikipedia.org/wiki/Programming_model

https://en.wikipedia.org/wiki/Big_data

https://en.wikipedia.org/wiki/Parallel_computing

https://en.wikipedia.org/wiki/Distributed_computing

https://en.wikipedia.org/wiki/Cluster_(computing)

3.1.3 Data Access

Hive was initially created by Facebook for its own infrastructure processing, later they

made it open source and donated it to the Apache Software Foundation. Hive supports custom

map-reduce scripts to be plugged into queries. The programmer can writes HQL (Hive Query

Language), which is a SQL-like declarative language, to display results on the console. Data in

Hive is organized into tables, partitions and buckets. Tables are analogous to tables in relational

database. Each table has a corresponding HDFS directory. The data in tables is serialized and

stored the files with that directory. Hive provides built-in serialization formats which compress

data and support lazy de-serialization. Each table can have more one or more partitions which

determine the distribution of data within sub-directories of the table directory. Also Data in each

partition may be divided into buckets based on the hash of a column in the table. Each bucket is

stored as a file in the partition directory [9].

The Hive architecture include the following components: External Interface-both iser

interfaces like command line and web UI, and application programming interface(API) like

JDBC and ODBC. The Hive Thrift server eposes a very simple client API to execute HQL

statements. The metastore is the system catalog and the Driver manages the life cycle of a HQL

statement during compilation, optimization and execution. The compiler is invoked by the driver

upon receiving a HQL statement, then it translates the statement into a plan, which consists of

DAG of Mapreduce jobs. The driver submits the individual Mapreduce jobs form the DAG to the

execution engine in a topological order[9].

We will discuss machine learning toolkits like Mahout, MLlib, H2O, and Samoa separately in

a single section 3.2

3.1.4 Data Management

According to ZooKeeper project wiki[10]: “ ZooKeeper is a centralized service for

maintaining configuration information, naming, providing distributed synchronization, and

providing group services”. ZooKeeper allows distributed processes to coordinate with each

other through a shared hierarchical name space of data registers (or znodes). Zoo keep is

analogous to a file system. Unlike normal file systems ZooKeeper provides ordered access to the

znodes. The performance aspects of ZooKeeper allow it to be used in large distributed systems.

What differentiate ZooKeeper and standard file systems is that every znode can have data

associated with it (every file can also be a directory and vice-versa). znodes only can have

limited amount of data. Coordination data, status information, configuration, location

information, etc. are stored in znodes. The server itself is replicated over a set of machines,

which maintain an in-memory image of the data tree along with a transaction logs. One

disadvantage is that data size of the database that ZooKeeper can manage is limited by memory.

Clients only connect to a single ZooKeeper server[10].

3.1.5 Privacy Data Protection

Nowadays, some privacy information is necessary to provide in social life, including

hospital, government, travel, etc. The information always was analysis to provide further

application, which may contain private sensitive information. The information can be retrieved

by data mining techniques, consequently privacy preserving data mining is a new investigation in

data mining and statistical database[11]. As hadoop leads the proliferation of massive data

analysis, numerous research on studying mining techniques to the MapReduce framework in

Hadoop. Randomization and distributed privacy preserving techniques are two classes of

privacy preserving data mining technique. Randomization is a model that generally use data

perturbation and noise addition and distribute privacy preserving technique is a method that

process secure multi-party computation and encryption to share mining result without sensitive

information [11].

Kangsoo Jung and his team published “Hiding a Needle in a Haystack: Privacy Preserving

Apriori Algorithm in MapReduce Framework” in 2014[12] to introduce a new privacy

preserving data mining (PPDM) to overcome some existing drawback. The concept of their

privacy-preserving algorithm is based on “hiding a needle in a haystack”, which is the idea that

detecting a rare class of data is hard to find in a large size of data. The proposed method, a

dummy item was added as noise to original transaction data; then a special code is used to

identify the dummy and the original items. This method does not reduce the performance a lot

but protect the privacy requirement satisfies.

3.2. Machine Learning Toolkits

A variety of machine learning toolkits have been created to facilitate the learning process but

many researchers and practitioners do not use them for various reasons. Most often because they

lack needed features or are difficult to integrate into an existing environment. Landset et al 2015

[13] paper summarized four of the most comprehensive machine learning packages that runs on

Hadoop. Their direct comparisons are shown in the following Table:

Table 1. Overviews of machine learning toolkits run on Hadoop [13]

Mahout

Mahout is one of the more well-known tools for machine learning, it has a wide selection of

robust algorithms, including classification, clustering and collaborative filtering. In May 2017,

Mahout was updated to 0.13.0 version [14]. It integrated Samsara, which includes linear algebra,

statistical operations and data structures. The goal of the Mahout-Samsara project is to help

users to build their own distributed algorithms, rather than simply a library.

MLib

MLib originally convers the same range of learning categories as Mahout, and also adds

regression models, which Mahout lacks. They also have algorithms for topic modeling and

frequent pattern mining. Additional tools include dimensionality reduction, feature extraction

and transformation, optimization, and basic statistics. Mlib is based on Spark’s iterative batch

and streaming approaches, as well as its use of in-memory computation, thus much faster than

those using Mahout. The drawback is that is dependent on Spark, and may present a problem for

those who perform machine learning on multiple platform[13][15].

H2O

H2O is a mature product in comparing to other projects. It has an enterprise edition, as well as

open source support without the purchase of a license. More importantly, it provides a graphic

user interface (GUI) and numerous tools for deep neural networks. Additionally, it provides

deep leaning algorithms that are also targeted for the purpose of research [13][16].

SAMOA

SAMOA, a platform for machine learning from streaming data, was originally developed at

Yahoo! Labs in Barcelona in 2013 and has been part of the Apache incubator since late 2014. It

provides a collection of distributed streaming algorithms for the most common data mining and

machine learning tasks such as classification, clustering, and regression, as well as programming

abstractions to develop new algorithms that run on top of distributed stream processing engines

(DSPEs). It features a pluggable architecture that allows it to run on several DSPEs such as

Apache Storm, Apache S4, Apache Samza, and Apache Flink. SAMOA is similar to Mahout in

spirit, but specific designed for stream mining [13].

4. Architecture of proposed system

The proposed architecture is illustrated as the following:

At the system software level, Linux OS and its variant is proposed to use. Linux provides

system services, augmented with parallel files systems and batch schedulers for parallel job

management. At top of this hardware, Hadoop includes a distributed files system (HDFS) for

managing large data. On top the Hadoop storage system, tools (such as Pig and Hive) provide a

high level programming model for data accessing. Flume is a distributed, reliable, and available

system for efficiently collecting, aggregating and moving large amounts of log data from many

different sources to a centralized data store [5]. On the top level is machine learning toolkits for

Domain

Knowledge

analytic purposes. Individuals or organizations can also develop stand-alone applications tuned

for specific purposes with domain knowledge.

5. Proposed Approaches

This proposal attempts to provide a framework to facilitate the fast-development of research-

oriented machine learning toolkits in this big data era, not specifically targeting any specific

scientific research domain. The required features of machine learning toolkits vary greatly

among topics. A better approach to test the performance of the proposed framework is to do case

studies by a given domain.

Although the use cases vary, there are a few factors needed to be considered in judging the

performance of the framework:

Scalability: this is regarded to both the size and complexity of the data. The type of

data to be processed should be taken into account.

Speed: it is considered as throughput rate. Factors affecting speed include

processing platform. It can be a crucial concern if the models are required to be

updated often.

Coverage: this refers to the range of option contained in the toolkit regarding

machine learning algorithm

Usability: this refers to user friendly interface, low cost of maintenance and good

documentation, plenty of choices of programming languages.

Extensibility: this refers how easy the system copes with changes. The

development of tools is usually fast in the industry, can the system adapt new

changes for better performance?

6. Future work plan

High-end data analytics (big data) and high-end computing (extra scale) are both essential

elements of an integrated computing-research-development. Research and development of next

generation algorithms, software, and applications is as crucial as investment in devices and

hardware. Single stand-alone application tuned for specific scientific research is not sufficient.

Therefore, fostering developing sub- ecosystem of research-oriented computing and analysis

tools/products targeted for each knowledge domain must become a trend.

References:

[1] http://hadoop.apache.org/

[2] Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the 19th

ACM Symposium on Operating Systems Principles. (2003)

[3] Dean, J., Ghemawat, S. Mapreduce: Simplified data processing on large clusters. In:

Proceedings of OSDI’04: Sixth Symposium on Operating System Design and Implementation.

(2004)

[4] J. Yates Monteith, John D. McGregor and John E. Ingram: Hadoop and its evolving

ecosystem. Proceeding of IWSECO(2013)

[5] D. Reed and Dongarra J. Exascale Computing and Big Data. DOI:10.1145/2699414

[6] Shvachko, K., H. Huang, S. Radia, R. Chansler: The Hadoop Distributed File System. The

proceedings of the 2010 IEEE 26th

Symposium on Mass Storage Systems and

Technologies(MSST), MSST’10, pages 1-10, Washington, DC, USA, 2010. IEEE Computer

Society.

[7], J. Lu and Jun Feng, International Conference on Cyberspace Technology (CCT 2014),

Beijing, 2014, pp. 1-4. doi: 10.1049/cp.2014.1315

[8] Dean, J. and S. Ghemawat: MapReduce: Simplified Data Processing on large clusters. In

OSDI 2004

[9] Thusoo A., J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, R.

Murthy. Hive-A Warehousing Solution Over a Map-reduce Framework. In VLDB (2009).

[10] https://cwiki.apache.org/confluence/display/ZOOKEEPER/Index

[11] Aris Gkoulalas–Divanis, Vassilios S. Verykios: Association Rule Hiding For Data Mining.

Springer, DOI 10.1007/978-1-4419-6569-1, Springer Science + Business Media, LLC 2010

[12] Kangsoo Jung, Sehwa Park and Seog Park. Hiding a needle in a Haystack: Privacy

Preserving Apriori Algorithm in MapReduce Framework. Proceeding PSBD `14 Proceedings of

the First International workshop on Privacy and Security of Big Data Pages11-17. 2014

[13] S. Landset, Khoshgoidraar T., Richter A.N. and Hasanin T. A survey of open source tools

for machine learning with big data in the Hadoop ecosystem. Journal of Big Data 2:24. 2015

[14] http://mahout.apache.org/

[15] http://spark.apache.org/mllib/

[16] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html

http://hadoop.apache.org/

https://cwiki.apache.org/confluence/display/ZOOKEEPER/Index

http://mahout.apache.org/

http://spark.apache.org/mllib/

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html

Machine learning with big data in the Hadoop Ecosystem for...

Documents

Transcript of Machine learning with big data in the Hadoop Ecosystem for...