Post on 20-Feb-2022
2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016)
ISBN: 978-1-60595-379-3
Big Data Architecture, Platform, Application and Trend
Wen-Ying ZENG1, a,*
1Computer Engineering Technical College, Guangdong Institute of Science and Technology,
Guangdong Polytechnic of Science and Technology, Zhuhai, P. R. China awyzeng@126.com
*Corresponding author
Keywords: Big Data, Review, Architecture, Platform, Application, Trend.
Abstract. Big data has become the hottest topic with the cloud computing and massive data
generation. Based on Hadoop, MapReduce and their ecosystem, there are number of improved and
optimized big data dealing and analysis platforms. Extensive applications in the Internet of Things,
social network, transport, medical, education, and etc. have attracted huge interests and attentions of
researchers and industries. To form a hostile profile view, it is necessary to explore big data
architecture, platform, application and trend. Abstract architecture and workflow graph of big data is
presented, some hot platforms are reviewed and illustrated, and big data application and trend are
proposed. The main contribution is to provide detailed and systematic survey and discussion to form a
whole technology framework and overview of big data.
Introduction
Big data is generally thought as with the characteristics of 3Vs (volume, velocity, variety) or
expanded 5Vs (volume, velocity, variety, veracity, value). It generally means the petabytes or
terabytes level data volume. The huge data generates continuously and exceed the ordinary processing
ability of general computers or networks, and the traditional analysis algorithms can’t complete big
data in accessible limited time of humans.
Big data contains implied semantic and domain knowledge that need to combine with domain
experts to perform analysis. Big data are generated from the society life and real world, and the
implied knowledge extracted from it can reflect and discover the future development trends, which
help people make timely decision and improve the operations of business.
The patent value and interest propel the development of big data. Big data is in dynamical
evolution, and a number of creativity on architecture, algorithms, platform, and application is
bombing in growth. Therefore, the whole prospect of big data technology should be summarized to
make more progress and take more advantages of it.
The paper is organized as follows. Section 2 introduces related works. Section 3 and 4 describes
property, classification, architecture of big data; Section 5 lists main platforms of big data; Section 6
illustrates usual application of big data. In section 7, the trend of big data technology is described.
Finally, the conclusion is given.
Related Work
Big data refers to datasets whose size is beyond the ability of typical database software tools to
capture, store, manage, and analyze [1]. Massive data from social media, mobile application, the
Internet of things, and more has result to the hot research in big data. In fact, big data has widely
sources, including telecommunication, web sites, sales, marketing, RFID enabled logistics, utility of
water and energy consumption, finance and insurance, health and prescription, environment, transport,
biological, archaeological, geological exploration, nuclear physics, molecular biology, astronomy,
and etc.[2]. Big data implies technology of platform, tools, techniques to capture, process, ingest, and
visualize large datasets at a reasonable time and cost. Big data also means big data science, framework
and infrastructure to implement big data analysis and knowledge discovery [3].
Because big data has various data sources, e.g. file systems and database systems and other, it is
expected to access by unified interface. The authors in [4] proposed a unified Data Interface
All-iN-A-place (DIANA). A DIANA-based cloud storage system is constructed for versatile, long
distance and large volume big data accessing operations by encapsulating multi-stream/multi-path
engine at the socket level and a new data communication protocol (CloudJet).
Big data storage and processing are considered as one of the main applications for cloud computing
systems. In order to integrate big data processing with cloud M2M systems, the paper [5] proposed a
Machine to Machine (M2M) communications and enabled novel tele-monitoring architectures built
on Remote Telemetry Units and Exhaled Cloud View for E-Health application of big data and IoT
systems.
Exploring and visualizing very large datasets with scalability is a major research challenge. The
system [6] developed by Semantic Web community in the context of the Web of Linked Data is a
solution.
Because of the new characteristics of big data, new platforms and frameworks are required for
management and analysis. Cloud computing can deliver on-demand computing and storage resource
over the Internet on a pay-for-use mode. Cloud computing can also accommodate a huge volume of
data and methodology to deal with big data. The integration of cloud computing and big data is a trend.
Recent research focuses on resource allocation, workflow, security, energy efficiency, and etc.
Big Data Property and Classification
Big Data Property
Big data refers to datasets whose size and complexity is beyond the ability of conventional software
tools in accessible time and cost. Big data is opposite to small data. The usual size of big data is big, as
compared in Fig.1.
Figure 1. The Realms of Data Sizes [7].
Big data definition is generally thought as including three V characteristics, and can be gradually
expanded to four, five, or six properties from different views, as shown in Fig.2.
Figure 2. From 3Vs, 4Vs, 5Vs to 6Vs Big Data Definition [8].
Big Data Classification
According to the characteristics and process procedure of big data, big data can be classified by
multiple criteria: data source, data type, content format, data stores, analysis, and processing
framework [9]. The classification of big data is as shown in Fig.3 [9]. Based on it, we can expand the
big data knowledge and visualization dimension after process framework.
Figure 3. Big Data Classification.
Big Data Architecture
Big data system is comprised of big data ingestion, analysis, visualization, deployment and
distribution modules [2]. Big data architecture of application system contains design, development,
maintenance, decommissioning and integration. Designing methods can be problem analysis,
requirements definition, and business data structure, and etc. [2]. Data view is NoSQL, process view
can be process-oriented, business model, task-oriented, and function modules.
Based on the big data application in social science [10], we summarize an abstract architecture of
big data, which comprises data input, data cleaning, data storing, data mining and analysis, data
visualization and knowledge output. This can be shown in Fig.4. The abstract architecture represents
that the large dataset processing flow is data input, data cleaning, data storing, data mining and
analysis, data visualization and knowledge output.
Figure 4. Abstract Architecture of Big Data.
According to the view of users centered, the architecture of big data can be simplified to a
hierarchical structure, which includes distributed storage, parallel computing, and collaborative
pre-process, concurrent inquiring. We present the structure as Fig.5.
From the viewpoint of application of big data, we propose the architecture of big data application
based on [11], which contains scalable DBMS and DMS, application server cluster, load balance and
management, application clients. It is shown as Fig.6. Every layer may be distributed on concurrent.
Figure 5. The Hierarchical Structure of Big
Data System.
Figure 6. The Application Architecture of
Big Data
.
Big data constitutes core technologies, components and mechanisms in data ingesting, storage,
processing, analysis, visualization, and delivering results to target applications. Big data may involve
the following technologies [3]: (1) Cloud based solutions: Cloud computing can help to provide lower
storage costs and stronger computing capability by renting, such as Amazon Web Services (AWS). (2)
Virtual file systems: Open source and vendor specific products can implement service based approach
and distributed storage. (3) New generation products: Hadoop and NoSQL database can compose and
analyze data from heterogeneous data source from sensor network, social media, Internet and etc. (4)
Sensors: Sensor devices including cameras, telescopes, medical machines, chemical, physics and
biological sensors can continually generate data, which is the one of the main data source needed to be
processed and mined. (5) Cluster systems: Cluster computer systems with hundreds and thousands of
nodes and benefits of scalability, efficiency and reliability can offer efficient storage and computing
power to organize, analyze, and output knowledge. (6) Data analysis: Efficient algorithms and models
of computing and statistical and optimization can use large collections of data to analyze fast. (7) Data
storage technologies: NAS (Network attached storage) and SAN (Storage area network) can be used
in big data.
As shown in Fig.7, NIST Big Data Reference Architecture [12] has been proposed in September
2015, which considered security and privacy management in the whole architecture, and added
message and communication, resource management module.
Figure 7. NIST Big Data Reference Architecture [12].
Big Data Platform
There are multitudes of tools and platforms which bring in difficult for users to select the right tools
to deal with big data. It is necessary to illustrate various big data tools and platforms. There are many
service providers of big data platform, such as IBM, Oracle, Microsoft, HP, SAS, EMC, Amazon,
Google. The core big data tools and platforms available in the industry can be classified into analytics
& reporting tools, data integration & governance framework. They include broad scopes which may
be overlaps: Hadoop Ecosystem, NoSQL Database, In Memory Database, Data Warehousing
Application, Streaming Event Processing Frameworks, Search Engines, and Berkeley Data Analytics
Stack (BDAS) [13]. Hadoop and its improved version i.e. Spark are foundation and popular. The
integration services vendors of big data platforms are Hortonworks, Cloudera, MapR, Palantir,
Pivotal, Splunk, DataStax, Datameer, and etc [14]. Some of the typical platforms are illustrated as
follows.
Hadoop Platform Structure
The Critical components of the Hadoop software stack as shown in Fig.8.
Sqoop
Data Exchange
Flume
LogCollector
Zookeeper
Coordiantion
Oozie
Workflow
Hbase
Columnar Database
Pig
Scripting
Mahout
Machine Learning
R Connectors
Analytics
HIVE
Data W
arehousing
Figure 8. Critical Components of the Hadoop Software Stack [15].
Spark Stack
Spark is an open source project and a general-purpose framework for cluster computing [16]. The
Spark stack is shown in Fig.9, among it, Spark Core contain functionality of Spark, including task
scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark
Core provides APIs for building and manipulating RDDs. Spark SQL is a package working with
structured data, allowing querying data via SQL and to intermix SQL queries with programmatic data
manipulations supported by RDDs in Python, Java, and Scala. Spark Streaming enables processing of
live streams of data, including log files by web servers, queues of messages of status of web service
and etc. MLib contains common machine learning (ML) library of algorithms, such as classification,
regression, clustering, collaborative filtering, model evaluation and data import. GraphX is a library
for manipulating graphs such as social network friends graph and performing graph-parallel
computations. Standalone Scheduler is Spark’s simple cluster manager, and Spark can support and
run over other cluster managers such as YARN and Meos.
Figure 9. The Spark Stack [16].
Cloudera’s Open Source Platform
Hadoop is an ecosystem of open source components to perform enterprises store, process, and analyze
data. The Hadoop ecosystem is complex and constantly changing. Hadoop enables multiple types of
analytic workloads to run on the same data at the same time. Cloudera Manager is the easiest way to
administer Hadoop in any environment, with intelligent configuration defaults, customized
monitoring, and robust troubleshooting. CDH, Cloudera's open source platform, is the most popular
distribution of Hadoop and related projects [17]. CDH architecture is shown in Fig.10.
Figure 10. CDH Architecture [17].
Hortonworks Data Platform
Hortonworks Data Platform (HDP) integrates a series of big data process, analytics management
components and tools. HDP is enterprise-ready open source Apache Hadoop distribution based on a
centralized architecture (YARN) [18]. HDP architecture is shown as Fig.11.
Figure 11. Hortonworks Data Platform Architecture [18].
Application of Big Data
Big data has wide range of applications in business, social network, education, networking,
healthcare, mobile systems, web logs, traffic control systems, weather forecasting, fraud control,
media and entertainment, disaster prevention, etc. Big Data software and services support an
innovative ecosystem and generate value by enabling completely new generation of solutions that
were not possible before [3].
In our opinion, big data technology should be applied in the emergent and difficulty areas which
can’t be resolved by usual way. They are global environment protection, natural disaster prevention
and prediction, humankind survival and sustainability, and etc. The whole worlds collaborate and
share the relative big data of specified areas, and perform continuous online analysis and possible
improvement may be needed.
The topical, multimodal, and longitudinal social media datasets from the integration of various
scalable open source technologies is presented [10] by using scalable open source technologies.
The workflow of big data application of environment can be illustrated as multiple phases [19]. To
add the first step as goal definition and improvement, the general workflow model of big data import,
process, analytics and decision loop flow graph is as shown in Fig.12.
Figure 12. Construction Data Analytics Lifecycle Stages.
Big Data Trend
We have explored the documents by keyword in the science direct web sites from the year of 1998
to 2015 in ScienceDirect Library, and have found that the number of big data related increased at
exponential rate, as shown in Fig.13. It shows that big data application is the most frequent word, then
the second is big data trend, and big data architecture, big data platform is the third. Surely there are
other related vocabularies not shown here share similar phenomena such as big data analysis, mining,
intelligence, machine learning, visualization, and etc.
Figure 13. Document Analysis of Big Data at ScienceDirect Library.
Big data may face with data integration, intelligent management, security and privacy protection,
prediction accuracy, responding performance, and etc.
From big data collection, process, knowledge discovery and distribution, big data will meet with
many challenges especially in privacy and security, flexibility, complexity. Meanwhile, big data
future trend will be more security, accurate, reliable, automatic and convenient.
Now that big data is accumulated and analyzed, knowledge may attain to be increased day by day.
How to use the knowledge and build relation may be a problem. On the other hand, new cases may
occur and the accuracy and rightness of the knowledge extracted from big data will possibly change.
Generally speaking, the future trend of big data will make more and more progress in technology,
economic, coverage of the target industries and areas, and efficiency avenues to a clean and wisdom in
many things. Challenges lie in data volume, redundancy of replications, performance, the complexity
of MapReduce framework, limited SQL support, lack of essential skills in data mining and machine
learning [20].
Conclusion
Big data involves multiple technologies, including cloud computing, parallel computing, mobile
computing, database, artificial intelligence, data mining, domain expertise, and etc. From the general
angle, the paper tends to illustrate a hostile comprehension in big data architecture, platform,
application, and future tend.
Acknowledgements
This research is supported by the following fund or projects: the Chinese National Scholarship
Fund; the Higher Vocation Education Teaching Reform Project of Guangdong Province, Project
No.201401091; Guangdong Province Higher Vocational Education Brand Major Construction
Project; Guangdong Province Higher Vocational Education Information Technology Project, Project
No. XXJS-2013-1008; the Guangdong Institute of Science and Technology Teaching Reform
Research Project, Project No. JG201502; Guangdong Province Higher Vocational and Colleges
Cloud Computing and Big Data Education Science Research Project, Project No. GDYJSKT16-02.
The author is highly grateful for the above supports.
References
[1] Dong, Fang, Malloy Alisha, “Recent research advances in cloud computing and big data,” in
Concurrency and Computation: Practice and Experience, 2015, Vol. 27(18), pp. 5574-5576.
[2] Atif Mohammad, Hamid Mcheick, Emanuel Grant. Big data architecture evolution: 2014 and
beyond. September 2014 DIVANet '14: Proceedings of the fourth ACM international symposium on
Development and analysis of intelligent vehicular networks and applications.
[3] Nidhi Grover.‘Big Data’- Architecture, Issues, Opportunities and Challenges. International
Journal of Computer & Electronics Research, 01 March 2014, Vol. 3(1), pp. 26-31.
[4] WangFrank Z., Theo Dimitrakos, Na Helian, and et al. DIANA: Data Interface All-iN-A-place
for Big Data.2014 IEEE 13th International Conference on Trust, Security and Privacy in Computing
and Communications, 2014(9):665-672. IEEE Digital Library, 2014-09-24.
[5] George Suciu & Victor Suciu & Alexandru Martian1 & Razvan Craciunescu & Alexandru Vulpe
& Ioana Marcu & Simona Halunga & Octavian Fratu. Big Data, Internet of Things and Cloud
Convergence – An Architecture for Secure E-Health Applications
[6] Nikos Bikakis, Timos Sellis. Exploration and Visualization in the Web of Big Linked Data: A
Survey of the State of the Art. LWDM ‘16 March 15, 2016, Bordeaux, France: 1-8.
[7] Maksim Tsvetovat, Alexander Kouznetsov. Social Network Analysis for Startups. O’Reilly
publish Inc., 2012.
[8] Caesar Wu, Rajkumar Buyya, Kotagiri Ramamohanarao. Big Data Analytics = Machine
Learning + Cloud Computing. 27 pages, 23 figures. a Book Chapter in "Big Data: Principles and
Paradigms, R. Buyya, R. Calheiros, and A. Dastjerdi (eds), Morgan Kaufmann, Burlington,
Massachusetts, USA, 2016."
[9] S. K. Divakar Mysore and S. Jain, Big Data Architecture and Patterns, Part 1: Introduction to Big
Data Classification and Architecture, IBM Big Data and Analytics, Technical Library, 2013
[10] Eugene Ch'ng. The Value of Using Big Data Technologies in Computational Social Science.
August 2014 BigDataScience '14: Proceedings of the 2014 International Conference on Big Data
Science and Computing. J Med Syst (2015) 39: 141. DOI 10.1007/s10916-015-0327-y
[11] Divyakant Agrawal Sudipto Das Amr El Abbadi. Big data and cloud computing: current state and
future opportunities. EDBT 2011, March 22–24, 2011, Uppsala, Sweden.
[12] NIST Big Data Public Working Group, Security and Privacy Subgroup. NIST Big Data
Interoperability Framework: Volume 4, Security and Privacy.
http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-4.pdf.
[13] Sourav Mazumder, S. Yu, S. Guo (eds). Chapter 2 Big Data Tools and Platforms. Big Data
Concepts, Theories, and Applications, 2013:pp29-30. DOI 10.1007/978-3-319-27763-9_2
[14] http://www.xueliedu.com/a/xinwenzixun/2015/0114/262117.html
[15] T. White, Hadoop. The definitive guide, 3rd ed., O’Reilly, Sebastopol, Canada, 2012.
[16] Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia. Learning Spark. Oreily
Media Inc., 2015.
[17] http://www.cloudera.com/products.html
[18] http://hortonworks.com/products/hdp/
[19] Muhammad Bilal, Lukumon O. Oyedele, Olugbenga O. Akinade, Saheed O. Ajayi, Hafiz A.
Alaka, Hakeem A. Owolabi, Junaid Qadir, Maruf Pasha, Sururah A. Bello. Big data architecture for
construction waste analytics (CWA): A conceptual framework. Journal of Building Engineering, In
Press, Accepted Manuscript, Available online 8 March 2016:1-38.
[20] Nawsher Khan, Ibrar Yaqoob, Ibrahim Abaker Targio Hashem, and et al. Big data: survey,
technologies, opportunities, and challenges. Hindawi Publishing Corporation, e Scientific World
Journal, Volume 2014: 1-18. http://dx.doi.org/10.1155/2014/712826.