Big Data Architecture, Platform, Application and Trend

2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016)

ISBN: 978-1-60595-379-3

Wen-Ying ZENG1, a,*

1Computer Engineering Technical College, Guangdong Institute of Science and Technology,

Guangdong Polytechnic of Science and Technology, Zhuhai, P. R. China awyzeng@126.com

*Corresponding author

Keywords: Big Data, Review, Architecture, Platform, Application, Trend.

Abstract. Big data has become the hottest topic with the cloud computing and massive data

generation. Based on Hadoop, MapReduce and their ecosystem, there are number of improved and

optimized big data dealing and analysis platforms. Extensive applications in the Internet of Things,

social network, transport, medical, education, and etc. have attracted huge interests and attentions of

researchers and industries. To form a hostile profile view, it is necessary to explore big data

architecture, platform, application and trend. Abstract architecture and workflow graph of big data is

presented, some hot platforms are reviewed and illustrated, and big data application and trend are

proposed. The main contribution is to provide detailed and systematic survey and discussion to form a

whole technology framework and overview of big data.

Introduction

Big data is generally thought as with the characteristics of 3Vs (volume, velocity, variety) or

expanded 5Vs (volume, velocity, variety, veracity, value). It generally means the petabytes or

terabytes level data volume. The huge data generates continuously and exceed the ordinary processing

ability of general computers or networks, and the traditional analysis algorithms can’t complete big

data in accessible limited time of humans.

Big data contains implied semantic and domain knowledge that need to combine with domain

experts to perform analysis. Big data are generated from the society life and real world, and the

implied knowledge extracted from it can reflect and discover the future development trends, which

help people make timely decision and improve the operations of business.

The patent value and interest propel the development of big data. Big data is in dynamical

evolution, and a number of creativity on architecture, algorithms, platform, and application is

bombing in growth. Therefore, the whole prospect of big data technology should be summarized to

make more progress and take more advantages of it.

The paper is organized as follows. Section 2 introduces related works. Section 3 and 4 describes

property, classification, architecture of big data; Section 5 lists main platforms of big data; Section 6

illustrates usual application of big data. In section 7, the trend of big data technology is described.

Finally, the conclusion is given.

Related Work

Big data refers to datasets whose size is beyond the ability of typical database software tools to

capture, store, manage, and analyze [1]. Massive data from social media, mobile application, the

Internet of things, and more has result to the hot research in big data. In fact, big data has widely

sources, including telecommunication, web sites, sales, marketing, RFID enabled logistics, utility of

water and energy consumption, finance and insurance, health and prescription, environment, transport,

biological, archaeological, geological exploration, nuclear physics, molecular biology, astronomy,

and etc.[2]. Big data implies technology of platform, tools, techniques to capture, process, ingest, and

visualize large datasets at a reasonable time and cost. Big data also means big data science, framework

and infrastructure to implement big data analysis and knowledge discovery [3].

Because big data has various data sources, e.g. file systems and database systems and other, it is

expected to access by unified interface. The authors in [4] proposed a unified Data Interface

All-iN-A-place (DIANA). A DIANA-based cloud storage system is constructed for versatile, long

distance and large volume big data accessing operations by encapsulating multi-stream/multi-path

engine at the socket level and a new data communication protocol (CloudJet).

Big data storage and processing are considered as one of the main applications for cloud computing

systems. In order to integrate big data processing with cloud M2M systems, the paper [5] proposed a

Machine to Machine (M2M) communications and enabled novel tele-monitoring architectures built

on Remote Telemetry Units and Exhaled Cloud View for E-Health application of big data and IoT

systems.

Exploring and visualizing very large datasets with scalability is a major research challenge. The

system [6] developed by Semantic Web community in the context of the Web of Linked Data is a

solution.

Because of the new characteristics of big data, new platforms and frameworks are required for

management and analysis. Cloud computing can deliver on-demand computing and storage resource

over the Internet on a pay-for-use mode. Cloud computing can also accommodate a huge volume of

data and methodology to deal with big data. The integration of cloud computing and big data is a trend.

Recent research focuses on resource allocation, workflow, security, energy efficiency, and etc.

Big Data Property and Classification

Big Data Property

Big data refers to datasets whose size and complexity is beyond the ability of conventional software

tools in accessible time and cost. Big data is opposite to small data. The usual size of big data is big, as

compared in Fig.1.

Figure 1. The Realms of Data Sizes [7].

Big data definition is generally thought as including three V characteristics, and can be gradually

expanded to four, five, or six properties from different views, as shown in Fig.2.

Figure 2. From 3Vs, 4Vs, 5Vs to 6Vs Big Data Definition [8].

Big Data Classification

According to the characteristics and process procedure of big data, big data can be classified by

multiple criteria: data source, data type, content format, data stores, analysis, and processing

framework [9]. The classification of big data is as shown in Fig.3 [9]. Based on it, we can expand the

big data knowledge and visualization dimension after process framework.

Figure 3. Big Data Classification.

Big Data Architecture

Big data system is comprised of big data ingestion, analysis, visualization, deployment and

distribution modules [2]. Big data architecture of application system contains design, development,

maintenance, decommissioning and integration. Designing methods can be problem analysis,

requirements definition, and business data structure, and etc. [2]. Data view is NoSQL, process view

can be process-oriented, business model, task-oriented, and function modules.

Based on the big data application in social science [10], we summarize an abstract architecture of

big data, which comprises data input, data cleaning, data storing, data mining and analysis, data

visualization and knowledge output. This can be shown in Fig.4. The abstract architecture represents

that the large dataset processing flow is data input, data cleaning, data storing, data mining and

analysis, data visualization and knowledge output.

Figure 4. Abstract Architecture of Big Data.

According to the view of users centered, the architecture of big data can be simplified to a

hierarchical structure, which includes distributed storage, parallel computing, and collaborative

pre-process, concurrent inquiring. We present the structure as Fig.5.

From the viewpoint of application of big data, we propose the architecture of big data application

based on [11], which contains scalable DBMS and DMS, application server cluster, load balance and

management, application clients. It is shown as Fig.6. Every layer may be distributed on concurrent.

Figure 5. The Hierarchical Structure of Big

Data System.

Figure 6. The Application Architecture of

Big Data

Big data constitutes core technologies, components and mechanisms in data ingesting, storage,

processing, analysis, visualization, and delivering results to target applications. Big data may involve

the following technologies [3]: (1) Cloud based solutions: Cloud computing can help to provide lower

storage costs and stronger computing capability by renting, such as Amazon Web Services (AWS). (2)

Virtual file systems: Open source and vendor specific products can implement service based approach

and distributed storage. (3) New generation products: Hadoop and NoSQL database can compose and

analyze data from heterogeneous data source from sensor network, social media, Internet and etc. (4)

Sensors: Sensor devices including cameras, telescopes, medical machines, chemical, physics and

biological sensors can continually generate data, which is the one of the main data source needed to be

processed and mined. (5) Cluster systems: Cluster computer systems with hundreds and thousands of

nodes and benefits of scalability, efficiency and reliability can offer efficient storage and computing

power to organize, analyze, and output knowledge. (6) Data analysis: Efficient algorithms and models

of computing and statistical and optimization can use large collections of data to analyze fast. (7) Data

storage technologies: NAS (Network attached storage) and SAN (Storage area network) can be used

in big data.

As shown in Fig.7, NIST Big Data Reference Architecture [12] has been proposed in September

2015, which considered security and privacy management in the whole architecture, and added

message and communication, resource management module.

Figure 7. NIST Big Data Reference Architecture [12].

Big Data Platform

There are multitudes of tools and platforms which bring in difficult for users to select the right tools

to deal with big data. It is necessary to illustrate various big data tools and platforms. There are many

service providers of big data platform, such as IBM, Oracle, Microsoft, HP, SAS, EMC, Amazon,

Google. The core big data tools and platforms available in the industry can be classified into analytics

& reporting tools, data integration & governance framework. They include broad scopes which may

be overlaps: Hadoop Ecosystem, NoSQL Database, In Memory Database, Data Warehousing

Application, Streaming Event Processing Frameworks, Search Engines, and Berkeley Data Analytics

Stack (BDAS) [13]. Hadoop and its improved version i.e. Spark are foundation and popular. The

integration services vendors of big data platforms are Hortonworks, Cloudera, MapR, Palantir,

Pivotal, Splunk, DataStax, Datameer, and etc [14]. Some of the typical platforms are illustrated as

follows.

Hadoop Platform Structure

The Critical components of the Hadoop software stack as shown in Fig.8.

Data Exchange

LogCollector

Zookeeper

Coordiantion

Workflow

Columnar Database

Scripting

Mahout

Machine Learning

R Connectors

Analytics

Data W

arehousing

Figure 8. Critical Components of the Hadoop Software Stack [15].

Spark Stack

Spark is an open source project and a general-purpose framework for cluster computing [16]. The

Spark stack is shown in Fig.9, among it, Spark Core contain functionality of Spark, including task

scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark

Core provides APIs for building and manipulating RDDs. Spark SQL is a package working with

structured data, allowing querying data via SQL and to intermix SQL queries with programmatic data

manipulations supported by RDDs in Python, Java, and Scala. Spark Streaming enables processing of

live streams of data, including log files by web servers, queues of messages of status of web service

and etc. MLib contains common machine learning (ML) library of algorithms, such as classification,

regression, clustering, collaborative filtering, model evaluation and data import. GraphX is a library

for manipulating graphs such as social network friends graph and performing graph-parallel

computations. Standalone Scheduler is Spark’s simple cluster manager, and Spark can support and

run over other cluster managers such as YARN and Meos.

Figure 9. The Spark Stack [16].

Cloudera’s Open Source Platform

Hadoop is an ecosystem of open source components to perform enterprises store, process, and analyze

data. The Hadoop ecosystem is complex and constantly changing. Hadoop enables multiple types of

analytic workloads to run on the same data at the same time. Cloudera Manager is the easiest way to

administer Hadoop in any environment, with intelligent configuration defaults, customized

monitoring, and robust troubleshooting. CDH, Cloudera's open source platform, is the most popular

distribution of Hadoop and related projects [17]. CDH architecture is shown in Fig.10.

Figure 10. CDH Architecture [17].

Hortonworks Data Platform

Hortonworks Data Platform (HDP) integrates a series of big data process, analytics management

components and tools. HDP is enterprise-ready open source Apache Hadoop distribution based on a

centralized architecture (YARN) [18]. HDP architecture is shown as Fig.11.

Figure 11. Hortonworks Data Platform Architecture [18].

Application of Big Data

Big data has wide range of applications in business, social network, education, networking,

healthcare, mobile systems, web logs, traffic control systems, weather forecasting, fraud control,

media and entertainment, disaster prevention, etc. Big Data software and services support an

innovative ecosystem and generate value by enabling completely new generation of solutions that

were not possible before [3].

In our opinion, big data technology should be applied in the emergent and difficulty areas which

can’t be resolved by usual way. They are global environment protection, natural disaster prevention

and prediction, humankind survival and sustainability, and etc. The whole worlds collaborate and

share the relative big data of specified areas, and perform continuous online analysis and possible

improvement may be needed.

The topical, multimodal, and longitudinal social media datasets from the integration of various

scalable open source technologies is presented [10] by using scalable open source technologies.

The workflow of big data application of environment can be illustrated as multiple phases [19]. To

add the first step as goal definition and improvement, the general workflow model of big data import,

process, analytics and decision loop flow graph is as shown in Fig.12.

Figure 12. Construction Data Analytics Lifecycle Stages.

Big Data Trend

We have explored the documents by keyword in the science direct web sites from the year of 1998

to 2015 in ScienceDirect Library, and have found that the number of big data related increased at

exponential rate, as shown in Fig.13. It shows that big data application is the most frequent word, then

the second is big data trend, and big data architecture, big data platform is the third. Surely there are

other related vocabularies not shown here share similar phenomena such as big data analysis, mining,

intelligence, machine learning, visualization, and etc.

Figure 13. Document Analysis of Big Data at ScienceDirect Library.

Big data may face with data integration, intelligent management, security and privacy protection,

prediction accuracy, responding performance, and etc.

From big data collection, process, knowledge discovery and distribution, big data will meet with

many challenges especially in privacy and security, flexibility, complexity. Meanwhile, big data

future trend will be more security, accurate, reliable, automatic and convenient.

Now that big data is accumulated and analyzed, knowledge may attain to be increased day by day.

How to use the knowledge and build relation may be a problem. On the other hand, new cases may

occur and the accuracy and rightness of the knowledge extracted from big data will possibly change.

Generally speaking, the future trend of big data will make more and more progress in technology,

economic, coverage of the target industries and areas, and efficiency avenues to a clean and wisdom in

many things. Challenges lie in data volume, redundancy of replications, performance, the complexity

of MapReduce framework, limited SQL support, lack of essential skills in data mining and machine

learning [20].

Conclusion

Big data involves multiple technologies, including cloud computing, parallel computing, mobile

computing, database, artificial intelligence, data mining, domain expertise, and etc. From the general

angle, the paper tends to illustrate a hostile comprehension in big data architecture, platform,

application, and future tend.

Acknowledgements

This research is supported by the following fund or projects: the Chinese National Scholarship

Fund; the Higher Vocation Education Teaching Reform Project of Guangdong Province, Project

No.201401091; Guangdong Province Higher Vocational Education Brand Major Construction

Project; Guangdong Province Higher Vocational Education Information Technology Project, Project

No. XXJS-2013-1008; the Guangdong Institute of Science and Technology Teaching Reform

Research Project, Project No. JG201502; Guangdong Province Higher Vocational and Colleges

Cloud Computing and Big Data Education Science Research Project, Project No. GDYJSKT16-02.

The author is highly grateful for the above supports.

References

[1] Dong, Fang, Malloy Alisha, “Recent research advances in cloud computing and big data,” in

Concurrency and Computation: Practice and Experience, 2015, Vol. 27(18), pp. 5574-5576.

[2] Atif Mohammad, Hamid Mcheick, Emanuel Grant. Big data architecture evolution: 2014 and

beyond. September 2014 DIVANet '14: Proceedings of the fourth ACM international symposium on

Development and analysis of intelligent vehicular networks and applications.

[3] Nidhi Grover.‘Big Data’- Architecture, Issues, Opportunities and Challenges. International

Journal of Computer & Electronics Research, 01 March 2014, Vol. 3(1), pp. 26-31.

[4] WangFrank Z., Theo Dimitrakos, Na Helian, and et al. DIANA: Data Interface All-iN-A-place

for Big Data.2014 IEEE 13th International Conference on Trust, Security and Privacy in Computing

and Communications, 2014(9):665-672. IEEE Digital Library, 2014-09-24.

[5] George Suciu & Victor Suciu & Alexandru Martian1 & Razvan Craciunescu & Alexandru Vulpe

& Ioana Marcu & Simona Halunga & Octavian Fratu. Big Data, Internet of Things and Cloud

Convergence – An Architecture for Secure E-Health Applications

[6] Nikos Bikakis, Timos Sellis. Exploration and Visualization in the Web of Big Linked Data: A

Survey of the State of the Art. LWDM ‘16 March 15, 2016, Bordeaux, France: 1-8.

[7] Maksim Tsvetovat, Alexander Kouznetsov. Social Network Analysis for Startups. O’Reilly

publish Inc., 2012.

[8] Caesar Wu, Rajkumar Buyya, Kotagiri Ramamohanarao. Big Data Analytics = Machine

Learning + Cloud Computing. 27 pages, 23 figures. a Book Chapter in "Big Data: Principles and

Paradigms, R. Buyya, R. Calheiros, and A. Dastjerdi (eds), Morgan Kaufmann, Burlington,

Massachusetts, USA, 2016."

[9] S. K. Divakar Mysore and S. Jain, Big Data Architecture and Patterns, Part 1: Introduction to Big

Data Classification and Architecture, IBM Big Data and Analytics, Technical Library, 2013

[10] Eugene Ch'ng. The Value of Using Big Data Technologies in Computational Social Science.

August 2014 BigDataScience '14: Proceedings of the 2014 International Conference on Big Data

Science and Computing. J Med Syst (2015) 39: 141. DOI 10.1007/s10916-015-0327-y

[11] Divyakant Agrawal Sudipto Das Amr El Abbadi. Big data and cloud computing: current state and

future opportunities. EDBT 2011, March 22–24, 2011, Uppsala, Sweden.

[12] NIST Big Data Public Working Group, Security and Privacy Subgroup. NIST Big Data

Interoperability Framework: Volume 4, Security and Privacy.

http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-4.pdf.

[13] Sourav Mazumder, S. Yu, S. Guo (eds). Chapter 2 Big Data Tools and Platforms. Big Data

Concepts, Theories, and Applications, 2013:pp29-30. DOI 10.1007/978-3-319-27763-9_2

[14] http://www.xueliedu.com/a/xinwenzixun/2015/0114/262117.html

[15] T. White, Hadoop. The definitive guide, 3rd ed., O’Reilly, Sebastopol, Canada, 2012.

[16] Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia. Learning Spark. Oreily

Media Inc., 2015.

[17] http://www.cloudera.com/products.html

[18] http://hortonworks.com/products/hdp/

[19] Muhammad Bilal, Lukumon O. Oyedele, Olugbenga O. Akinade, Saheed O. Ajayi, Hafiz A.

Alaka, Hakeem A. Owolabi, Junaid Qadir, Maruf Pasha, Sururah A. Bello. Big data architecture for

construction waste analytics (CWA): A conceptual framework. Journal of Building Engineering, In

Press, Accepted Manuscript, Available online 8 March 2016:1-38.

[20] Nawsher Khan, Ibrar Yaqoob, Ibrahim Abaker Targio Hashem, and et al. Big data: survey,

technologies, opportunities, and challenges. Hindawi Publishing Corporation, e Scientific World

Journal, Volume 2014: 1-18. http://dx.doi.org/10.1155/2014/712826.

Big Data Architecture, Platform, Application and Trend

Documents

Transcript of Big Data Architecture, Platform, Application and Trend

Open Platform Trend 2008

PLATFORM Magazine Issue 10: Architecture

Cisco Expo 2010 · Transport Messaging Platform Price/ Performance Vertically Integrated Architecture Media Experience Platform Platform Sustainability Distributed Virtualized Architecture

OpenShift Container Platform 3.7 Architecture · OpenShift Container Platform 3.7 Architecture OpenShift Container Platform 3.7 Architecture Information Last Updated: 2018-11-23

Trend Analysis BIHAR - GHG Platform India

Insurance Digital Platform Trend & Application Of InsurTech

EXCITEMENT open platform: Architecture and Interfaceshltfbk.github.io/Excitement-Open-Platform/specification/... · 2017. 4. 15. · EXCITEMENT open platform: Architecture and Interfaces

Platform System Architecture 2nd Release - Deserve Project...Platform System Architecture – 2nd Release Deliverable n. D25.2 - Platform System Architecture (Second Release) Sub Project

VCE Shared Management Platform Architecture … Management Platform Architecture Overview Accessing VCE documentation 6 ... SMP is equipped with management ... Shared Management Platform

VCE Shared Management Platform Architecture · PDF fileInter-Vblock system networking ... Shared Management Platform Architecture Overview Accessing VCE ... Shared Management Platform

trend Micro DEEP SECURITY 9 · 2014-03-27 · PLATFORM ARCHITECTURE deep security Virtual Appliance. Transparently enforces security policies on VMware vSphere virtual machines for

Digital Platform Reference Architecture

OpenShift Container Platform 4.6 Architecture

Trusted Platform Module Library Part 1: Architecture · Trusted Platform Module Library Part 1: Architecture Page ii Published Family “2.0” ... Part 1: Architecture Trusted Platform

MedINRIA 2.0 - Platform architecture overview

Trend 3 Platform Economy: Technology-driven …...Trend 3 Platform Economy: Technology-driven business model innovation from the outside in Industry leaders are unleashing technology’s

Open In-Vehicle Platform architecture

OpenShift Container Platform 3.9 Architecture · OpenShift Container Platform 3.9 Architecture OpenShift Container Platform 3.9 Architecture Information Last Updated: 2019-02-27

L19_22 Embedded Platform Architecture

Trend Analysis SIKKIM - GHG Platform India