arXiv:2009.00858v1 [cs.CR] 2 Sep 2020found many applications for next-generation wireless systems...

20
1 A Survey on Blockchain for Big Data: Approaches, Opportunities, and Future Directions Deepa N, Quoc-Viet Pham, Dinh C. Nguyen, Sweta Bhattacharya, B. Prabadevi, Thippa Reddy Gadekallu, Praveen Kumar Reddy Maddikunta, Fang Fang, Pubudu N. Pathirana Abstract—Big data has generated strong interest in various scientific and engineering domains over the last few years. Despite many advantages and applications, there are many challenges in big data to be tackled for better quality of service, e.g., big data analytics, big data management, and big data privacy and secu- rity. Blockchain with its decentralization and security nature has the great potential to improve big data services and applications. In this article, we provide a comprehensive survey on blockchain for big data, focusing on up-to-date approaches, opportunities, and future directions. First, we present a brief overview of blockchain and big data as well as the motivation behind their integration. Next, we survey various blockchain services for big data, including blockchain for secure big data acquisition, data storage, data analytics, and data privacy preservation. Then, we review the state-of-the-art studies on the use of blockchain for big data applications in different vertical domains such as smart city, smart healthcare, smart transportation, and smart grid. For a better understanding, some representative blockchain-big data projects are also presented and analyzed. Finally, challenges and future directions are discussed to further drive research in this promising area. Index Terms—Blockchain, Big Data, Vertical Applications, Smart City, Smart Healthcare, Smart Transportation, Security. I. I NTRODUCTION The global data traffic has increased at an unprecedented rate over the last decade, thus the special interest in ”big data”. As reported in [1], the big data market shall reach 229.4 billion $ in 2025 and significantly reduce the expenditure for various vertical industries like healthcare, retail, transportation and logistics, manufacturing, media and entertainment. Despite the lack of a precise definition, attention to big data can be seen in many scientific and engineering areas, e.g., computer vision, Internet of Things (IoT) data analytics, operation management, and smart cities. Adding to the structural embodiment, [2] considered big data from three aspects, including attributive, Deepa N, Sweta Bhattacharya, B. Prabadevi, Thippa Reddy Gadekallu, Praveen Kumar Reddy Maddikunta are with the School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, India (e-mail: {deepa.rajesh, sweta.b, prabadevi.b, thippareddy.g, praveenku- marreddy}@vit.ac.in). Quoc-Viet Pham is with the Research Institute of Computer, Information and Communication, Pusan National University, Busan 46241, Korea (e-mail: [email protected]). Dinh C. Nguyen, Pubudu N. Pathirana are with School of Engineering, Deakin University, Waurn Ponds, Australia (email: [email protected], [email protected]) Fang Fang is with Department of Engineering, Durham University, Durham DH1 3LE, UK (email: [email protected]) This work was supported by a National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIT) under Grants NRF- 2019R1C1C1006143 and NRF-2019R1I1A3A01060518. Quoc-Viet Pham is the corresponding author. comparative, and architectural. According to [3], big data can be identified as a new generation of technologies and architectures investigated to analyze a large amount of data and capture its main characteristics (e.g., high velocity, knowledge discovery, and analytics). The comparative aspect considers big data as the datasets, which has a very large size and dimensionality and cannot be stored, managed, analyzed, and captured by conventional database tools [4]. From the architectural viewpoint, big data is identified as the datasets, which have very large volume, velocity, and representation, and require significant horizontal scaling methods for efficient processing [5]. Nevertheless, there are various challenges and issues asso- ciated with big data techniques and applications, for example, data security and privacy, energy management, scalability of computing infrastructure, data management, data interpreta- tion, real-time data processing, big data intelligence. Among these challenges, security and privacy have been considered as important issues since big data often involves different types of sensitive personal information, e.g., age, addresses, personal preference, banking details, etc. There have been various solutions and techniques investigated to preserve data confidentiality and private information. An example is [6], where matching theory and a coalitional game were jointly utilized to optimize a resource allocation problem so as to secure mobile social networks with big data. The use of rein- forcement learning was investigated in [7] to design a security- aware algorithm for a smart grid system. Recently, blockchain as a ledger technology has emerged as attractive solutions for providing security and privacy in big data systems. For example, it was shown in [8] that blockchain can play a vital role in providing high-quality data and securing data sharing for industrial IoT applications. In [9], a blockchain- based mechanism was proposed for securing data collection in mobile ad hoc networks and incentivizing mobile nodes for efficient data collection. Furthermore, blockchain was also integrated with edge computing servers to enhance the data quality and process the compute-intensive tasks requested by IoT devices with security guarantees [10]. With its unique advantages, blockchain has the great potential to transform current big data systems by providing efficient security fea- tures and network management capabilities for enabling newly emerging big data services and applications. In this survey, we present a comprehensive review of blockchain for big data, ranging from approaches to opportunities and future directions. arXiv:2009.00858v2 [cs.CR] 5 Feb 2021

Transcript of arXiv:2009.00858v1 [cs.CR] 2 Sep 2020found many applications for next-generation wireless systems...

  • 1

    A Survey on Blockchain for Big Data:Approaches, Opportunities, and Future DirectionsDeepa N, Quoc-Viet Pham, Dinh C. Nguyen, Sweta Bhattacharya, B. Prabadevi, Thippa Reddy Gadekallu,

    Praveen Kumar Reddy Maddikunta, Fang Fang, Pubudu N. Pathirana

    Abstract—Big data has generated strong interest in variousscientific and engineering domains over the last few years. Despitemany advantages and applications, there are many challenges inbig data to be tackled for better quality of service, e.g., big dataanalytics, big data management, and big data privacy and secu-rity. Blockchain with its decentralization and security nature hasthe great potential to improve big data services and applications.In this article, we provide a comprehensive survey on blockchainfor big data, focusing on up-to-date approaches, opportunities,and future directions. First, we present a brief overview ofblockchain and big data as well as the motivation behind theirintegration. Next, we survey various blockchain services for bigdata, including blockchain for secure big data acquisition, datastorage, data analytics, and data privacy preservation. Then, wereview the state-of-the-art studies on the use of blockchain forbig data applications in different vertical domains such as smartcity, smart healthcare, smart transportation, and smart grid. Fora better understanding, some representative blockchain-big dataprojects are also presented and analyzed. Finally, challenges andfuture directions are discussed to further drive research in thispromising area.

    Index Terms—Blockchain, Big Data, Vertical Applications,Smart City, Smart Healthcare, Smart Transportation, Security.

    I. INTRODUCTION

    The global data traffic has increased at an unprecedentedrate over the last decade, thus the special interest in ”bigdata”. As reported in [1], the big data market shall reach 229.4billion $ in 2025 and significantly reduce the expenditure forvarious vertical industries like healthcare, retail, transportationand logistics, manufacturing, media and entertainment. Despitethe lack of a precise definition, attention to big data can be seenin many scientific and engineering areas, e.g., computer vision,Internet of Things (IoT) data analytics, operation management,and smart cities. Adding to the structural embodiment, [2]considered big data from three aspects, including attributive,

    Deepa N, Sweta Bhattacharya, B. Prabadevi, Thippa Reddy Gadekallu,Praveen Kumar Reddy Maddikunta are with the School of InformationTechnology and Engineering, Vellore Institute of Technology, Vellore 632014,India (e-mail: {deepa.rajesh, sweta.b, prabadevi.b, thippareddy.g, praveenku-marreddy}@vit.ac.in).

    Quoc-Viet Pham is with the Research Institute of Computer, Informationand Communication, Pusan National University, Busan 46241, Korea (e-mail:[email protected]).

    Dinh C. Nguyen, Pubudu N. Pathirana are with School of Engineering,Deakin University, Waurn Ponds, Australia (email: [email protected],[email protected])

    Fang Fang is with Department of Engineering, Durham University, DurhamDH1 3LE, UK (email: [email protected])

    This work was supported by a National Research Foundation of Korea(NRF) Grant funded by the Korean Government (MSIT) under Grants NRF-2019R1C1C1006143 and NRF-2019R1I1A3A01060518. Quoc-Viet Pham isthe corresponding author.

    comparative, and architectural. According to [3], big datacan be identified as a new generation of technologies andarchitectures investigated to analyze a large amount of data andcapture its main characteristics (e.g., high velocity, knowledgediscovery, and analytics). The comparative aspect considersbig data as the datasets, which has a very large size anddimensionality and cannot be stored, managed, analyzed,and captured by conventional database tools [4]. From thearchitectural viewpoint, big data is identified as the datasets,which have very large volume, velocity, and representation,and require significant horizontal scaling methods for efficientprocessing [5].

    Nevertheless, there are various challenges and issues asso-ciated with big data techniques and applications, for example,data security and privacy, energy management, scalability ofcomputing infrastructure, data management, data interpreta-tion, real-time data processing, big data intelligence. Amongthese challenges, security and privacy have been consideredas important issues since big data often involves differenttypes of sensitive personal information, e.g., age, addresses,personal preference, banking details, etc. There have beenvarious solutions and techniques investigated to preserve dataconfidentiality and private information. An example is [6],where matching theory and a coalitional game were jointlyutilized to optimize a resource allocation problem so as tosecure mobile social networks with big data. The use of rein-forcement learning was investigated in [7] to design a security-aware algorithm for a smart grid system. Recently, blockchainas a ledger technology has emerged as attractive solutionsfor providing security and privacy in big data systems. Forexample, it was shown in [8] that blockchain can play avital role in providing high-quality data and securing datasharing for industrial IoT applications. In [9], a blockchain-based mechanism was proposed for securing data collectionin mobile ad hoc networks and incentivizing mobile nodesfor efficient data collection. Furthermore, blockchain was alsointegrated with edge computing servers to enhance the dataquality and process the compute-intensive tasks requested byIoT devices with security guarantees [10]. With its uniqueadvantages, blockchain has the great potential to transformcurrent big data systems by providing efficient security fea-tures and network management capabilities for enabling newlyemerging big data services and applications. In this survey, wepresent a comprehensive review of blockchain for big data,ranging from approaches to opportunities and future directions.

    arX

    iv:2

    009.

    0085

    8v2

    [cs

    .CR

    ] 5

    Feb

    202

    1

  • 2

    A. State of the Arts and Our Contributions

    Due to the importance of blockchain and big data, therehave been a number of surveys published in related topicsover the past few years. One of the earliest surveys onblockchain was carried out in [11]. Privacy and security issuesof blockchain systems were reviewed in [12], [13], [14],[15]. The survey in [16] presented applications (e.g., gamefor mining management, game for security/privacy issues,and game for blockchain applications) of game theories forblockchain systems. Various surveys have been conducted tostudy applications of blockchain for other technologies. Forexample, the possibility of utilizing blockchain for IoT systemscan be found in [17], [18], [19], [14]. The integration ofblockchain with edge computing and 5G systems were studiedin [20] and [21], respectively. The surveys in [22], [23] carriedout reviews of applications and opportunities of blockchainfor smart grid networks. Moreover, several surveys have beendedicated to reviewing the fundamentals and applications ofbig data analytics. A survey on techniques and technologiesfor big data management was presented in [24]. Recent studiesin [25], [26] reviewed and discussed the roles and applicationsof big data for IoT systems and smart cities. Big data analyticshave also found applications in smart grid and intelligent trans-portation systems, and representative surveys can be found in[27], [28]. The concept of mobile big data was reviewed in[29] and recently found many applications for next-generationwireless systems (e.g., 5G, beyond 5G, and 6G), from thephysical and MAC layers to the application layer [30], [31].

    In spite of many research efforts, we are not aware ofany survey that comprehensively studies the applicability ofblockchain for big data applications. Although the surveyin [32] reviews blockchain for big data applications andchallenges, it is very short and not updated since it hasbeen published several years ago. The survey in [33] mainlyreviews the use of blockchain to address security issues inedge computing-based IoT applications. Other surveys in [21],[18], [34] also mention the interplay between blockchain andbig data, but they only provide brief introductions on this topicwithout an in-depth survey unlike our paper. Motivated by theabove observations, we provide a comprehensive survey onblockchain for big data, which covers fundamental knowledge,up-to-date approaches, opportunities, research challenges, is-sues, and future directions. The key objective of this survey isto inspect the state-of-the-art studies and to carry out a reviewon the applicability of blockchain for big data applications. Insummary, the contributions and features offered by this workcan be stated as the following.

    • Firstly, we present an overview of blockchain and big dataas well as the motivations behind the use of blockchainfor big data. We show that blockchain has the greatpotential for facilitating big data analytics such as controlof dirty data, enhanced security and privacy, enhancedquality of data, and the management of data sharing.

    • Secondly, we review four main blockchain services forbig data, including blockchain for secure data acquisition,blockchain for secure data storage, blockchain for dataanalytics, and blockchain for data privacy preservation.

    • Thirdly, we provide an extensive discussion of the useof blockchain in several popular big data applications,including smart healthcare, transportation and logistics,smart grid, and smart cities. Moreover, some popularblockchain-based big data projects are also introducedand analyzed.

    • Finally, we discuss a number of research challengesthat arose from the state-of-the-art survey on the use ofblockchain for big data. We also highlight open researchopportunities that provide a roadmap for future research.

    B. The Survey Organization

    The structure of this survey is organized as Fig. 1. Anoverview of blockchain and big data is presented in Section II,along with a discussion of the motivations of their integration.The main parts of this survey are given in Sections III and IV,which respectively present 1) blockchain services for big dataand 2) blockchain-big data applications and projects. Section Vdiscusses and highlights a number of research challenges,issues, and future directions. Finally, Section VI concludesthe article.

    Section III: Blockchain Services for Big Data

    Motivations of the Blockchain-Big Data

    Integration

    Section II: Blockchain and Big Data: An Overview

    Blockchain

    Section I: Introduction

    State of the Arts and Our Contributions The Survey Organization

    Big Data

    Blockchain for Big Data Storage

    Blockchain for Big Data Analytics

    Blockchain for Big Data Privacy

    Preservation

    Blockchain for Big Data Acquisition

    Section IV: Blockchain Big Data Applications and Projects

    Blockchain Big Data in Smart

    Healthcare

    Blockchain Big Data in Smart

    Transportation

    Blockchain Big Data in Smart Grid

    Blockchain Big Data in Smart City

    Blockchain Big Data Projects

    Section V: Research Challenges and Future Directions

    Research Challenges Research Challenges

    Section VI: Conclusions

    Fig. 1. Organization of this article.

    II. BLOCKCHAIN AND BIG DATA: AN OVERVIEW

    This section presents the background and recent devel-opments of blockchain, big data, and motivations of theirintegration.

    A. Blockchain

    Blockchain is presently one of the most prevalent disruptivetechnologies which is paving the way for emerging financialand industrial services [35], [36]. Conceptually, it consistsof a list of records commonly known as blocks wherein

  • 3

    information stored is encrypted ensuring privacy and security.Also, unlike other technologies, blockchain is a decentralizednetwork wherein the participating members have completeauthority to monitor all transactions in the blockchain networkin a peer-to-peer (P2P) manner [37], [38]. The blockchaintechnology is an amalgamation of varied multidisciplinaryconcepts such as software engineering, cryptography, dis-tributed computing, creating an infrastructure emphasizing ondigital assets related security. The combination of all theseconcepts is commonly termed as cryptoeconomics that createrobust P2P networks for facilitating the use and transfer of as-sets among computers in digital markets [39]. Cryptocurrencycan be considered as the present and future mode of financialtransactions which support the aforementioned transparencyand security aspects of blockchain technology. In the digitalmarket, cryptocurrencies are existing in different forms suchas Bitcoin, Ethereum, Litecoin, Stellar, Ripple, Z-cash, Dash,etc [40], [41].

    Bitcoin as the most popular cryptocurrency platform wasintroduced by Sakato Nakamoto and since then almost 1600cryptocurrencies have evolved using the Bitcoin concept [42].In the case of Bitcoin, whenever a sender initiates a transac-tion, it is sent to the receiver through the transaction beingperformed on the public bitcoin network. User verification isconducted by miners in the network who also ensure that thesender has the necessary number of bitcoins to be sent to thereceiver without affecting the basic sanity of the network. Afterapproval and verification by the miner, the transaction is addedto the block which eventually becomes a part of the blockchainnetwork. Finally, the relevant transactions pertaining to theblock get executed thereby updating the ledgers across allthe nodes so that all participants share the same copy oftransaction for ensuring transparency and security [43].

    Blockchain platforms can be classified into three types;public, private and hybrid blockchains, based on their areasof application [44]. A public blockchain does not have anyspecific single owner and are visible to everyone in thenetwork. Bitcoin is an example of public blockchain whichis decentralized with its consensus process being availableto all participants in the network. The private blockchain onthe contrary is permissioned and controls the participation ofnetwork members to read from and write to the blockchain. Inhybrid blockchains, the public access is given to only specificgroup. It is a partially decentralized framework where theconsensus process is guided by rules agreed among all partiesregarding the control and access over the blockchain [45].Some of the most important features of blockchain are asfollows:

    1) Immutability: Blockchain is almost impossible to corruptdue to a permanent and unalterable network. It worksdifferently from the traditional banking system usingcollection of nodes and each node in the system hasa copy of the digital ledger [46]. When any transactionis initiated, nodes check its validity and authenticate toadd to the ledger. Hence, the success of any transactiondepends on the consensus across all major nodes whichmakes the framework transparent and secure. It alsoeliminates the chances of corruption which is especially

    evident in a public blockchain that allows everyone tosee the transactions but does not allow altering the datastored in the blockchain [47], [48].

    2) Decentralization: The network is not governed by asingle authority but a group of nodes that are respon-sible for maintaining the network. This decentralizedapproach allows participants to access the blockchainfrom the web and store their replicated information usingprivate keys [49].

    3) Security: Blockchain with its decentralized and im-mutable natures can provide high degrees of security[35], [50]. The use of cryptography includes imple-mentation of complex algorithms acting as firewallsagainst unauthorized attacks. Each information is hashedwhich hides its actual nature and also provides a uniqueidentification for each data. In the chain, each blockin the ledger holds its own hash and also the hash ofits previous block which makes it immutable to tamperthe data. Hashing also makes the framework irreversible.Such that, it is impossible to have a public key and createa private key out of it and corrupting the network wouldbasically mean changing each data stored on each nodein the network [51].

    4) Consensus: The operation of the blockchain frameworksrelies on associated consensus algorithms, which isresponsible for deciding the group of active nodes onthe network. This makes the validation process for atransaction faster and similar to a voting system [52].

    5) Accelerated Financial Settlement: The blockchain trans-actions are processed much faster in comparison to thetraditional banking systems. This technology enablesfaster transfer of money to foreign workers and overseastravelers. Smart contracts running on the blockchain alsohelp ensure faster settlement of contractual accounts[53].

    B. Big Data

    Big data is typically characterized by 4-V features, includingvolume, velocity, and variety, and veracity[54], [55]. Here, webriefly describe these features of big data.

    1) Volume: Volume simply means the quantity of data,i.e., whether or not a dataset is considered as big data.Regarding big data processing, one usually faces severalchallenges, which may include the curse of modularity(i.e., not available to store/load the complete data inmemory and hard disk), the curse of class imbalance(i.e., there may exist different data distributions), thecurse of dimensionality (i.e., the dataset has many fea-tures and attributes) [56]. Moreover, data non-linearity,variance and bias, and computing availability are alsoconsidered as challenges associated with the volumefeature of big data.

    2) Variety: Variety represents various types of data such asvideo, text, and audio, which are generally composedof structured data, semi-structured data, and unstruc-tured data. The major challenges caused by varietymay include data locality, data heterogeneity, dirty and

  • 4

    noisy data [57]. Here, data locality expresses that thecomplete data cannot be stored in a data center andis typically distributed over a large number of physicallocations. Data heterogeneity is referred to as variousheterogeneous sources of data, thus having different datatypes, formats, models, and semantics. Dirty and noisedata means that the data can contain noise and dirty,which would be caused by data collection methods, datasources, and generation time.

    3) Velocity: Velocity refers to the generation speed of data,i.e., how fast the data is generated to meet the demand.A massive number of mobile devices will be 13.1 billionin 2023, from 8.8 billion in 2018, which can generate anenormous amount of traffic [58]. Other good examplesof the unprecedented growth of data are high-definitionvideos, video gaming, and streaming platforms (e.g.,YouTube and IBM Cloud Video). In some literature, thisfeature is also considered as variability, that is, differentapplications may have different rates of data flow [26].For example, a vehicular crowdsensing system maygenerate more data in peak hours due to the participantof a large number of vehicles on the road.

    4) Veracity: Veracity refers to the quality aspect since thedata can be collected from multiple sources, which mayinclude low-quality and noisy samples. It is reasonablesince data can be generated by malfunctioning or un-calibrated IoT devices, untrusted devices, and can betransmitted to the data center via fading and dynamicwireless environments [59]. To improve the quality andanalytical accuracy of big data, the challenges of dataprovenance, uncertainty, dirty and noisy data should beeffectively tackled.

    Big data analytics is about extracting useful information andpatterns from the dataset, which are then used for differentpurposes and to create business and social values. In theliterature, this is usually considered as the fifth feature ofbig data, namely value. Big data has found applications inmany vertical domains such as smart grid, mobile and e-health,transportation and logistics, and wireless and communicationnetworking. Besides great opportunities, we have a number oftechnological challenges and issues of big data, for example,big data management, data cleansing, imbalanced systemcapacities, imbalanced data, data analytics, and learning fromdata [56]. For more details, we refer the interested reader tothe survey in [56] and the references therein.

    C. Motivations of the Blockchain and Big Data Integration

    Governments and private organizations are investing heavilyin big data and blockchain technologies due to their greatpotential in solving many real-world problems. In modernlife, the customers are more inclined to do the transactionsonline, and expanding amount of data is being generated everyday. This exponential rise in the digital data generated createsnew opportunities for industries to understand the customerneeds, purchasing patterns and trends of the customers. Bigdata analytics, which uses data mining and statistical modelsto analyze massive datasets, is playing a major role in helping

    the industries to gain insights into the purchase patterns ofthe customers[60]. However, the tremendous growth in thebig data presented its own challenges. Some of the keychallenges of big data are security and privacy issues, dirtydata, reliability of the data sources, sharing of the data, etc[61].These challenges faced by the big data can be addressed by theunique properties of the blockchain like decentralized storage,immutability, transparency, and consensus mechanisms. Themotivations of integrating blockchain with big data are dis-cussed as follows.

    • Improving Big Data Security and Privacy: As the numberof devices connected to the Internet is growing dayby day, the quantity of the data stored at third partylocations like cloud is increasing rapidly. This brings newchallenges like data breach or threats caused by curiousthird parties [62]. The traditional security solutions likefirewalls cannot address this issue of big data since theorganizations have no control over the data as it is notstored within the network perimeter of the organizations.The usage of blockchain to store the big data has thepotential to address this issue. The encrypted and decen-tralized storage of the data in the blockchain networkmakes it very difficult for any unauthorized access to thedata.

    • Improving Data Integrity: There exists a likelihood ofpeople tampering the records in big data to influencethe prediction of big data analytics in their favor. Theimmutability property of the blockchain ensures that itis next to impossible to tamper with the data stored inthe blockchain network. If someone wants to modifythe data in the blockchain network they have to modifythe data in at least 50% of the nodes in the blockchainnetwork, which is nearly impossible in practice. Also, theimmutability property of the blockchain ensures that datastored the blockchain network is reliable.

    • Fraud Prevention: The existing big data solutions rely onthe analysis of patterns in the historical data to detectfraudulent transactions. Hence big data cannot solve theproblem of fraudulent transactions in the financial sector.The storage of the big data in blockchain enables thefinancial institutions to monitor each transaction in realtime, hence allowing them to assess the potentially fraud-ulent transactions on the fly. As a result, the integration ofblockchain in big data can help the financial institutionsto prevent the frauds to protect their customers.

    • Real-Time Data Analytics: Since the blockchain storesevery transaction, it makes the real-time analytics ofbig data achievable. The banks and financial institutescan settle the cross-border transactions including largeamounts in near real-time as the blockchain integrated bigdata analytics enables the financial institutes to settle thetransactions quickly. Also, banks can monitor the changesin the data in real time, thus enabling them to makedecisions like blocking of the transactions in real time.

    • Enhancement of Data Sharing: The integration ofblockchain with big data helps service providers to sharethe data to other stakeholders with minimal risk of data

  • 5

    leakage. Also, if the big data generated from the differentsources is stored in blockchain, the repetition of theanalysis on the data can be eliminated as each experimentcarried out is recorded in the blockchain.

    • Enhancement of the Quality of Big Data: Data scientistsspend most of their time on data integration as differentsources follow different formats in data collection. Byusing blockchain for data storage, the quality of the datacan be improved as it is structured and complete. Hence,data scientists can work on the quality data to come upmore accurate predictions in real time.

    • Streamlining the Data Data Access: The use ofblockchain would simplify the life cycle of big dataanalytics by online streamlining the data access. Indeed,by involving multiple departments in an organization ina common blockchain, authorized users can get accessto the secure, trusted data without having to go throughseveral checks.

    III. BLOCKCHAIN SERVICES FOR BIG DATA

    The big data technology has grown tremendously as largecorporations and organizations use advanced analytical toolsto store, visualize and analyze data. However, due to the enor-mous data utilization and data transmission, big data security isa major challenge. Cloud computing has been widely used forbig data services despite some security concerns. Some third-party applications and intruders can easily perform maliciousactivities such as stealing sensitive data, crashing the serverwhen proper security mechanisms are not used [63]. Big datafaces challenges from a variety of perspectives, such as datacollection, data sharing, data storage and data analysis. Inthis section, we survey the blockchain-based approaches andservices for big data. An overview of blockchain services inbig data environment such as big data acquisition, big datastorage, big data analytics and big data privacy preservationis depicted in Fig. 2.

    A. Blockchain for Big Data Acquisition

    In general, big data applications acquire data from di-versified sources in a different format (unstructured data).These data cannot be processed in the native form. Therefore,the data must be converted to a structured format fromwhich various predictions on the application domain can bemade. Blockchain, with its capability of handling vast dataeffectively, provides structured data for making predictions.Blockchain ensures data integrity through consensus algo-rithms thereby mitigating the data attacks. Here, we analyzetwo subdomains in blockchain services for big data acquisi-tion, including blockchain for secure big data collection andblockchain for secure big data transmission/sharing.

    1) Blockchain for Secure Big Data Collection: Nowadays,big data applications have gained popularity but faced majorsecurity issues and challenges. Data collection is a very impor-tant task in the life cycle of data processing. Suspicious sourcesof data and communication links allow the data collection toexpose to various malicious attacks and threats. Therefore,secure data collection methodology is vital for various data

    applications. Several research works have been done so farto provide secure data collection. For example, a secure bigdata collection scheme based on blockchain is introduced formobile crowdsensing (MCS) [64]. Due to the rapid growthof portable smart mobile terminal devices such as mobileterminals (MT) and sensors, MCS has been efficiently appliedfor industrial Internet of thing (IIoT) environment. A MCSframework is developed with cloud servers and a set of MTs.The MCS servers publish some set of tasks related to sensingand choose MTs in the particular area to complete the tasks.The main challenge in performing data collection is limitedenergy resource in MT, the range of sensing devices and securedata sharing between MTs. A framework was proposed by[65] to overcome these challenges using blockchain and deepreinforcement learning (DRL). It provides energy efficientcollection of data and security for data sharing in a distributedenvironment. The distributed blockchain based DRL approachfor each MT provides extensive data collection and maximumrange for sensing devices. An Ethereum blockchain platformis used to provide data reliability and security while MTsshare the data. Ethereum maintains a secure ledger and shareswith the cooperating MTs without a trusted third party. Theproposed framework provides solutions for various attackssuch as majority attack, device failure, eclipse attack, etc [65].

    2) Blockchain for Secure Big Data Transmission/Sharing:Blockchain with its decentralized and immutable nature isable to provide secure big data transmissions. It also supportsreliable data sharing from data sources to data analytics,aiming to solve security and privacy issues remained intraditional data transmission protocols. Blockchain can ensurebig data training and prevent data theft to facilitate big datatransmissions. Data can be recorded from ubiquitous sourcessuch as data reports, data libraries, social media, or assistivegadgets. Then, they are added to the blockchain with signatureand hash values before sharing with data analytic services inwhich both data source owners and data analytic users cantrace and monitor the data sharing flow over the network whichin return provides high transparency and reliable data sharing.An example of big data transmission model with blockchainis illustrated in Fig. 3.

    In the literature, there are some research efforts devotedto use of blockchain for supporting big data transmissionsand sharing. The emergence of edge computing has seen anincreasingly vast amount of data on edge nodes, allowing end-users to optimize latency and the processing time. However,sharing sensitive information without proper authorization isa challenging task. The work in [66] introduces a blockchainmodel to share reliable data at the edge node. During this pro-cess, the authors pay attention to the reduction of the computa-tional process at the edge nodes using proof-of-collaboration.Besides, to reduce response time and storage overhead, au-thors introduced a blockchain-based futile transaction filteralgorithm that accesses data from the cache layer ratherthan the storage layer. Finally, the authors proposed expresstransactions and hollow blocks to increase the efficiency ofthe network in the proposed model. Express transactions withsmart contract is developed in order to support the validationof transactions that are not occurring at the same time period.

  • 6

    Ed

    ge N

    od

    e

    Data Collection

    Smart HospitalSmart Home

    Smart City Smart Grid

    Data Acquisition

    IoT Devices

    Cloud Server

    Data Storage

    (HDFS, IPFS, EduRSS)

    Big Data as a Service

    Blockchain Mining

    Smart ContractData Retrieval Data Mining

    Artificial Intelligence

    Learning Data Training Predictive Analysis

    Fig. 2. An overview of blockchain services in big data environment.

    Assistive

    GadgetsData Library Data Reports Social Media

    Smart IoT

    Systems

    Big Data StorageUnstructured

    Data

    Add a block to the chain

    after reuse

    Data Verification using

    Consensus algorithm

    Prevents data theft

    Reuse-Data trading

    Blockchain LedgerStructured

    Data

    Direct Data

    Transmission

    Secure Data

    Transmission

    Predict data trends

    Predict behavior

    Data visualization

    Data quality management AI for predictive

    analytics

    Predict Pattern

    Real Time Data Analytics

    Fig. 3. Secured blockchain services for big data processing.

    Also hollow block helps to diminish the redundancy that occurin block generation thereby increases the resource efficacyof the network. The experimental results demonstrate that

    the proposed model decreases 90% of computing resources,95% of storage resources and 27% of network resources. Theimmense growth of the cyber-physical system helps in pro-viding faster information services and real-time sensing. Bigdata sends information to the cyber-physical system, whichutilizes the radio spectrum. There is incredibly huge compet-itiveness in the spectrum auction and restricted license-freespectrum access. In [67], the authors propose a blockchain-based solution for license-free spectrum access using smartcontracts, which facilitates transferring non-real-time data ina secure manner. The proposed framework with edge nodeaimed to reduce latency provides a blockchain-based protocol,which improves the transaction process in a safe mode wheremultiple channels are created for spectrum, and each channel isallocated with dedicated blockchain. During the process, twoblocks are created, namely key block and micro block, wherekey block ensures to select an efficient spectrum license holder,and the micro block takes the responsibility of maintainingall the transaction details. Finally, the valid and authorizednode gets the spectrum license from the key block, and thenode maintains the license until the key block identifies thenext holder. However, we found that the protocol uses PoS-after-PoW for generating the key block to select a user, whichrequires high computation cost and consumes more number ofresources.

  • 7

    B. Blockchain for Big Data Storage

    1) Blockchain for Secure File Systems: There are severalcloud based services available to store and access files fromanywhere on any machine. Users, particularly organizationsare hesitant to store sensitive information on the systemmanaged by a third party. Even though encryption of filesbefore storing to the cloud is one of the solutions but stillsome challenges are faced by the cloud provider in terms ofsecurity. At present electronic information system are mostpopularly used in medical treatment. Volumes of data areproduced every day such as medical images, medical records,diagnostic reports, etc. Electronic medical information canaffect the treatment of the patients and the experience ofthe patient is shared with other medical academies. If theshared patient medical data is illegally misused, the privacyof the patient is compromised. Some mechanisms should beadopted to control the access of medical data. Blockchainintegrated with interplanetary file system (IPFS) provides thesolution for these kinds of security problems. IPFS is adecentralized storage platform developed to address the issueof file redundancy. It defines a unique hash value for the storedfile and the user is allowed to find the file based on the hashaddress. Attribute based encryption method is applied to themedical data before it is stored in the cloud storage. Theprivate key of the user is associated with their attributes andciphertext with their policy. Any user can perform decryptionon the ciphertext if the private key of the user satisfies theaccess policy available in the ciphertext. Also, blockchain isused to record the data storage and retrieval process. The hashvalue of medical storage data is stored in the blockchain toprovide evidence for the authenticity of user verification. Thedecentralized blockchain framework helps to provide securityfor the file storage and avoids single point of failure [68].

    2) Blockchain for Secure Database Management: Datawhich is stored in various types of database managementsystem is vulnerable to attacks from internal and externalsources. Database tampering detection methods were usedto detect the malicious updates in the databases. It usessingle-way cryptography hash functions along with digitalwatermarking to identify data misuse. But the method cannotbe applicable to distributed databases. A blockchain basedsolution is applied to store the data on distributed databasesand detect the malicious user transactions. Blockchain avoidsdata tampering by applying time stamping. Virtual sharedledger is incorporated to store the history of transactions.All the transactions are recorded in block and each block isinterconnected with each other with cryptographic hash values.When the data available in a block is updated by a maliciousattacker, the hash value of the block gets updated and theblock becomes invalid. A blockchain based scheme, namelyeducation records secure storage and sharing scheme, is pro-posed for privacy preserving and secure storage of educationrecords. The scheme integrates storage servers, cryptographicalgorithms and blockchain to develop a safe and reliableenvironment. Blockchain is integrated to provide reliabilityand security to the education database. The smart contracts inblockchain are applied to control the data sharing and storage

    process. The educational records are linked with the hashinformation stored in the blockchain to provide security to thestored data. Cryptographic algorithms and digital signaturesare used to maintain the encryption of the records [69].

    3) Blockchain for Big Data Storage Infrastructure: In thepast few years, big data has grown into a new standard whichprovides huge amount of data and prospects to enhance thedecision making applications in science and engineering. Atthe same time, it faces challenges in storing, processing andtransmitting the data. Cloud computing offers basic support toaddress the issues with shared resources such as networking,storage and computing. The increased readiness of data in AIprovide opportunities in the healthcare industry. The presentmachine learning (ML) algorithms transform a person datato medical data for data analytics preventing the patientsaccess to their medical data. Blockchain can provide securitysolutions to motivate the biomedical research and allow thepatients to access and control their personal data along withthe capability of monitoring their health records. Blockchainprovides decentralization facility for a transparent and securedistributed personal data [70].

    C. Blockchain for Big Data Analytics1) Blockchain for Secure Data Training: The development

    of edge and cloud computing has increased the amount ofdata in various scenarios. Several ML and deep learning (DL)methods are applied for effective data analysis. Support vectormachine (SVM) is one of the popular ML methods appliedfor its efficiency and accuracy. In vehicular social networks,data are gathered from various entities namely social networkcompanies, vehicular manufacturers and vehicle managementagencies. Data from various data sources normally differ inthe attributes. When training with SVM classifier, the entitiesface problem of the data with inadequate attributes due to thediversity of sources. Therefore, various entities must share datato integrate the dataset with multiple attributes and train theclassifier. Data privacy issue occurs due to sharing of data fromvarious entities. A privacy preserving blockchain based SVMtraining method was proposed for vertically partitioned datasetfrom various data providers. In this method, a blockchainconsortium and homomorphic cryptosystem were developedto implement a secure training platform without the need ofa trusted third party. The training operations are performedover the original data locally and the interactions between theentities are secured by the homomorphic cryptosystem andblockchain consortium. Blockchain consortium helps to builda public and secure data sharing environment for effectivecommunication between the entities when they share theattribute values [71].

    2) Blockchain for Secure Data Learning in AI Algorithms:The extensive construction and generation of data from sen-sors, social media, web and IoT devices resulted in the growthof Artificial Intelligence (AI) techniques. The data can beapplied to ML and DL algorithms for the purpose of dataanalytics. These methods depend on centralized server fortraining purpose and it leads to tampering of data. Thus thedecisions obtained from AI are erroneous and risky. There-fore the decentralized AI came into existence to solve this

  • 8

    problem and it is integration of blockchain and AI. Severallimitations of blockchain and AI are solved by integratingthese two technologies. AI techniques depend on data tolearn, gather and provide decisions. These techniques performbetter when the input data are gathered from various secure,reliable and trusted data repositories. Blockchain providessecure environments through distributed ledger in which datacan be recorded and transacted [72]. Here, data are storedwith high resiliency and integrity in blockchain and cannot betampered. When smart contracts are used for learning purposein AI algorithms to obtain decisions and analytics, the resultscan be undisputed and trusted. Therefore, the integration ofblockchain with AI can provide immutable, decentralized andsecure environment for learning the highly sensitive data.This integrated framework provides substantial development invarious domains such as banking, medical, financial, personaland trading [73].

    D. Blockchain for Big Data Privacy Preservation1) Blockchain for Privacy Preservation in Big Data Pro-

    cessing: Due to the rapid increase in generation of data,privacy preserving has become a main concern nowadays. Inthis era of big data, the data is regularly being gathered andexamined which leads to commercial and innovation growth.Big organizations and companies utilize the collected datato provide better customer services, optimize the decisionprocess and forecast the future developments. Thus data hasbecome a valuable asset in recent days. Big data is widelyapplied in smart city environment for extensive monitoringof city traffic and maintenance, ensuring quality of air andwater etc. A blockchain based model is proposed [74] forprivacy preserving in intelligent transportation system (ITS)for in-car navigation system in smart city environment. Themodel applies offline blockchain based storage in which allthe sensitive information from the users are stored in a securemanner. The sensitive data is encrypted with the help of sharedkey associated with a group of cars. Users can use varioussecurity features such as sharing details about speed enableand disable, location enable and disable, etc [74], [75].

    2) Blockchain for Privacy Preservation in Big Data Stor-age: The big data era is threatening the user privacy in variousdigital scenarios. Third party organizations are benefited inthe management of user data by gathering, analysing andmanaging the huge amount of user personal information. Theseservices provided by the third parties are prone to securitybreaches and data misuse without the knowledge of the users.Blockchain provides various solutions to the challenges facedby the user data. User transactions in blockchain do notface privacy concerns and users are provided with options tocontrol their personal information. The details about when,by whom, which and what personal information is revealed ineach transaction. Privacy preserving solutions are emerging forblockchain built on crypto-privacy methods to allow the usersto become unidentified and gain control over their personalinformation during their digital transaction in ledger [76]. Thevarious services provided by cloud environment for big data,challenges and blockchain based solutions are tabulated inTable I.

    IV. BLOCKCHAIN BIG DATA APPLICATIONS ANDPROJECTS

    Blockchain technologies have gained immense momentumwith its varied applications in various spheres of life. Thetechnology is still going through its phase of infancy and is be-ing experimented for providing solutions to various challengespertinent to security, data ownership, decision support systems,identity verification and decentralization. Our present gener-ation is traversing through an era of overwhelming volumeof digital data, being generated by man and machines. Hencethere emerges a desperate need to store, organize, process andanalyze this big data where the use of blockchain technologieshas a potentially significant role to play [77]. As an example,maintaining data ownership, data transparency and manage-ment of access control has always been a major challenge.Blockchain technology resolves this issue by storing accesspolicies to personal data in the blockchain framework. Byusing the blockchain technology, a decentralized personal datamanagement system is created by implementing a protocolallowing users to own and manage their data. The dependencyon third party is completely eliminated allowing organizationsto focus more on data utilization rather than security man-agement and compartmentalization [78]. The application ofblockchain in combination with big data is visible in twosegments - data management and data analytics. The variousblockchain based big data applications are summarized inTable II. In case of data management, blockchain technologiesbeing sure and distributed, are implemented to store importantdata. It can also evaluate data authenticity and stop tamper-ing of sensitive data. In the applications of data analytics,blockchain is used to analyze trading trends, prediction ofpotential customers, diseases or business partners [79].

    A. Blockchain Big Data in Smart City

    The rapid urbanization have led to the development ofsmart cities which requires efficient and intelligent solutionsfor its transportation, administration, environment and energyoptimization. The integration of IoT, big data and energyefficient Internet technologies has the capability to providesuch infrastructural solutions required for the smart city life.But there are numerous problems related to inferior security,reliability, maintenance, adaptability and costs. The blockchaintechnology caters to such needs having transparency, energyefficiency, space, recover-ability and maintenance of the IoTdevices. The study in [19] discusses the use of hash, asym-metric encryption, consensus algorithm, a blockchain structureand a Merkle tree in ensuring a tamper free transaction. Thisframework has blocks interlocked with one another within theblock itself with the help of a Merkle tree which makes iteven more secured for performing seamless transactions. Therecent years have also witnessed a surge in the developmentof big data based auditing systems termed as third partyauditors (TPAs). The TPAs are centralized frameworks whichare subjected to security issues within the cloud environment.Blockchain technologies have been used to create decentral-ized TPAs for smart cities with enhanced security and relia-bility. This framework is named as Data Auditing Blockchain

  • 9

    TABLE ISERVICES PROVIDED BY CLOUD ENVIRONMENT FOR BIG DATA, CHALLENGES AND BLOCKCHAIN BASED SOLUTIONS.

    Ref Cloud based services Challenges faced by Big data Solutions provided by Blockchain[65] Data collection. Data collection is exposed to various mali-

    cious attacks and threats.Blockchain provides energy efficient data collection and se-cure data sharing environment using Ethereum.

    [66] Data transmission/sharing. Lack of authorization for data sharing inedge nodes and response time is more.

    Blockchain based futile transaction filter algorithm helps toaccess data from cache layer instead of storage layer andhelps to reduce response time and storage overhead. Smartcontracts are used for authorization.

    [68] File storage system. Unauthorized access to the electronic filesystem. Privacy, security and redundancyproblems.

    Blockchain integrated with IPFS provides the solution byimplementing decentralized platforms to solve file redun-dancy problems and provides security to the file storagesystem. Hash value of data is stored in blockchain to provideauthenticity to the users and an attribute based encryptionmethod is applied before data storage in cloud.

    [69] Database management system. Data stored in distributed database is ex-posed to internal and external attacks.

    Blockchain overcomes data tampering using time stampingmethod. Virtual shared ledger is applied to store the trans-action history. Database transactions are recorded in blockand each block is interconnected with each other usingcryptographic hash value. Blockchain based solution inte-grates storage servers, cryptographic algorithms for a reliabledatabase access.

    [71] Data training/learning process. Various entities share data to integrate thedataset with various attributes and train theML classifier. Data privacy issue occurswhile sharing data from various entities.

    Blockchain consortium and homomorphic cryptosystem pro-vide a secure training platform without the intervention of atrusted third party. Blockchain provides a secure environmentfor communication between the entities.

    [76] Data privacy preservation. User privacy is an issue in digital scenariosin big data era. Services provided by thirdparties are exposed to security breaches anddata misuse.

    Blockchain provides immutable, verifiable and decentralizedledger to record the transactions in digital scenarios.It pro-vides facilities to the user to control their personal data.Crypto-privacy methods are applied to solve privacy preserv-ing problems.

    (DAB), the entire audit history is traced and also allows ownersto audit their files at any point in time. It also includes thefeature of batch verification of various auditing proofs ensuringsecurity and prevention of privacy [80]. In [81], a blockchainbased infrastructure is presented that provides secured spatio-temporal smart contract services. The framework providessustainable IoT based shared economy in smart mega cities.The huge generation of big data has created the need to collect,analyze and utilize the same for autonomously predicting anyrisky or exceptional events from occurring. The frameworkconsists of device-to-device (D2D) communication systemsand fog nodes installed onsite to enable the blockchain andother offline operations. A three-tier architecture is used forsupporting shared economy services in the blockchain basedsmart city environment. The client tier includes the smartapplications, IoT and associated infrastructures. The client tiercommunicates with the mobile edge tower through WiFi, Zig-Bee, 5G and other related technologies. The MEC tower hoststhe blockchain nodes, data storage client, related databasesand cloudlet applications, thereby manages the load efficiently.The data from the blockchain, IoT and social network arefinally fed into the AI engine for performing sophisticatedanalysis such as digital forensics, emotion extraction andvarious others.

    B. Blockchain Big Data in Smart Healthcare

    Recent advances in the healthcare sector have led to a drasticrise in medical data generation. These data are extremelyimportant for diagnosis, predictions and treatment purposes.Healthcare professionals have recently started focusing onthe use of IoT and related wearable technologies wherein

    sensors, devices are vehicles are connected through the In-ternet providing services for the benefit of mankind. As anexample the remote patient monitoring system is a commondevice for treating elderly patients in particular. Althoughthese technologies have enormous benefits but have aforemen-tioned security issues while transferring and logging of datatransaction information. But these issues have possibilities ofextreme violation of data security and privacy. The use ofblockchain is a potential solution that would provide securityand efficiency in analysing data but it is costly and lagsenergy optimization. The study in [82] proposes a frameworkthat resolves such issues using public key, private key, lightweight cryptographic techniques in integration with blockchaintechnology. The framework thus provides an access controlof medical records for patients with improved privacy andsecurity. In [83] a secured smart health care system is proposedusing blockchain. The various private data, public data andrelated sensitive information are captured using sensors andthen encrypted using blockchain technologies. These types ofinformation are further stored in a distributed format ratherthan centralized cloud storage systems, which can be accessedonly by authorized individuals having approvals from patients.Similarly, the healthcare professionals seeking to access thepatient records need to send request to the patient and oncereal time notification is processed, information is available tothem. All the entities such as IoT devices, Electronic HealthRecords (EHRs), Encryption/decryption system, blockchainmechanisms in this framework remain connected throughwireless sensor networks (WSN) to conduct seamless yet se-cured communication. In [84], a private blockchain frameworkis proposed using Ethereum protocol wherein the sensors

  • 10

    communicate with the smart devices. These smart devicescall smart contracts which keep records of all events onthe blockchain. Thus, these smart contract systems help inmonitoring patients in real-time and also send notificationsto healthcare professional when medical interventions are re-quired. The saved records are secured, due to the connectivityin the blockchain which provides authentication and eliminatespossibilities of data tampering of EHRs.

    C. Blockchain Big Data in Smart Transportation

    Transportation helps to move human beings and goodsfrom one location to another. Although the application ofblockchain has the immense potential towards benefiting thetransportation sector, but individuals in this sector are notwell informed about this emerging technology. Various othertechnologies namely Mobility as a Service (MaaS), IoT, AIand DL have converged with blockchain technologies torevamp the traditional approach involved in transportation.The automotive sector has also used blockchain technologiesfor developing intelligent transportation systems and offerservices like remote software based vehicle operation system,automated insurance services, smart charging and cab sharingservices [85].

    Blockchain technologies have seen a rapid growth due toits potential to revolutionize intelligent transportation systems(ITS). Such developments can be used to create secured,reliable and autonomous ITS ecosystems with optimized usageof relevant infrastructure and resources. As an example, thestudy by [86] presented a seven layer conceptual model forITS that would help in characterizing the architecture andmajor components in a blockchain based system. The physicallayer holds the different vehicles, devices and assets relevant toITS. The main aspect of this layer includes the use of IoT forproviding enhanced security and privacy for the blockchainbased transportation systems. The data layer provides thedata blocks and associated encryption algorithms, hashingalgorithms and Merkle trees. The network layer defines theprocess involved in distributed networking, data forwardingand authentication. The packaging of the consensus algorithmsis done by the consensus layer followed by the incentive layerwhich specifies the mechanisms for issuance and allocationof coins to nodes in the blockchain network. The contractlayer constitutes algorithms and smart contracts that activatesthe process of data storage in the blockchain. Finally, theapplication layer encompasses the scenarios and use cases ofblockchain based ITS. Security is often a major concern invehicle communication systems. The study in [87] presented asecure key management framework for accomplishing networksecurity. The study utilizes the role of security managers whocapture vehicle departure data and encapsulate the blocks totransport keys and later implement rekeying to the vehicleswithin the secured domain. The framework proposes an ef-ficient key management system for key transfers among thesecurity managers in a heterogeneous vehicle communica-tion network architecture. In addition to security provision,blockchain technologies play a significant role in privacyprotection of ITS especially in car navigation. The framework

    proposed in [88] is based on an offline blockchain storagesystem wherein all sensitive data extracted from the usersare stored and later shared using specific encryption keysrelevant to a particular car cluster. The system uses twomajor applications namely the client application installed inthe users smart phone and the main application installed atthe server side. It is assumed that the smart phone and serverare configured securely and security policies both simple andcomplex are used depending on user types for data sharing.All clients in the network are grouped into clusters dependingon the location to optimize the use of computational resources,reduce network delays and overheads. The system allowsusers to define the privacy policies and they are later auto-matically implemented at the client application that providesaccurate transportation routes. Blockchain technologies canpotentially solve various problems relevant to car insurance.The insurers with the help of this technology will be able totrack their claims seamlessly by searching the trusted ledger.The study by [89] presented a prototype framework for fine-grained transportation insurance services where the premiumwas calculated based on vehicle usage and behavior of thedriver. These information were collected by streaming IoTdata collected using mobile sensors. This unique frameworkinitiated transparent insurance and also motivated drivers todrive safely in-order to achieve insurance incentives. Themobile GPS sensors in this framework were strategicallyplaced in vehicles for continuous monitoring of their GPSlocation. The GPS trajectory data were further uploaded to thepublic cloud or data center using the IoT suite and later savedin the GIS database. The IoT messages trigger spatio-temporaldata analytic function to extract driver behavior and vehicleusage data. These data get saved in the distributed ledgersystem on the blockchain ensuring transparency, trace-abilityand safety. In case of Ethereum based framework the premiumevaluation is done based on driver behavior and vehicle usagewhich is tokenized according to varying risk levels through afine grained process.

    D. Blockchain Big Data in Smart Grid

    Blockchain technologies can contribute significantly to im-prove the efficiency of practices and processes in the energysector. Blockchain integrated with big data has the ability toaccelerate the speed of development of IoT platforms anddigital applications thereby innovating the P2P energy tradingand decentralization services. The present energy systems areexperiencing radical transformations due to the advancementof distributed energy resources and use of information andcommunication technologies. The blockchain architectureshave the capability to solve issues relevant to controlling andmanaging of decentralized energy systems and micro-grids[90]. Smart grid is a technology that makes electrical powergrids more efficient, robust and less pollutant. The advancedmetering infrastructure (AMI) is one of the major componentsin smart grid architecture that ensures two way communicationbetween users and the utility device, by installing a smartmeter at the user end. Key management plays a major role inthis process and most of the traditional architectures depend

  • 11

    on a single entity to distribute the keys and maintenance.The study by [91] proposes a distributed key managementsystem to maintain optimum security in the smart grid system.A key agreement protocol is proposed between the utilityand the smart meter followed by the use of a distributedmulti-case key management scheme which allows group mem-bers to effectively manage their group communication. Theblockchain architecture enables distributed entities to interactwith each other in the distributed P2P network ensuring secu-rity, scalability and efficiency. In [92], a data integration andregulation system is proposed based on consortium blockchain.A signcryption algorithm is implemented to multidimensionaldata acquisition and the receivers in the blockchain framework.As part of the regulation process, the control center, thegrid operator and the grid supplier receive fixed blocks fromthe blockchain and later obtain plaintext from the decryptionprocess. At the outset, multidimensional data are analyzedby the relevant receivers. This results in creation of controlpieces. These control pieces takes care of the security anddata integrity aspect thereby reducing communication costs.In [93], a blockchain based demand response managementsystem is proposed. This system is termed as GUARDIAN andis capable of taking trading decisions pertaining to the energysector. The system is extremely secured and also contributessignificantly towards load management in the residential,industrial and commercial sector. The minor nodes in thisframework termed as block verifiers, are selected using theirspecific power consumption and processing power capability.These nodes help in the authentication of energy transactionsin the smart grid network. The transaction process in theproposed framework is initiated by the end user which createsthe block of transaction for energy trading. The miner nodesvalidate these blocks, add them to the blockchain and theybecome eligible to be part of the energy trading. This helpsin achieving security and eliminates unauthorized entries inenergy trading.

    E. Blockchain Big Data Projects

    Blockchain is a technology that empowers cryptocurrenciessuch as bitcoin and ethereum. On the other hand, big datais an advanced concept of data science which involves largerdataset with great variety, size and velocity. These datasets areanalyzed to reveal interesting patterns, association and trends.Interestingly, blockchain is a type of distributed ledger thatrecords transactions in a way that cannot be altered. Thereis an immense trust factor associated with blockchain thateliminates the need of third parties to regulate transactionsensuring the data is immutable. Blockchains have many appli-cations in data science where data integrity is maintained whileperforming data analysis and data sharing [94]. The benefits ofthe application of blockchain are applied in three areas namelyfor decentralized data storage, performing blockchain enableddata analysis and finally in maintaining blockchain enableddata security as shown in Fig. 4. Some of the blockchainprojects for big data applications are discussed below.

    1) Storj: Storj is an end-to-end decentralized storageproject which utilizes the excess hardware and bandwidth

    capacity, enabling peer to peer authentication of storage con-tracts between the providers and the users [95]. The processinvolves encryption of the files at the client side which arethe split into pieces termed as ”shards”. These shards are laterstored three times to maintain backups at the farmer side. Theclient only has access to the data which provides additionalsecurity than the traditional centralized cloud services. TheStorj cryptocurrency allows renters to check on the farmersfiles and also pay for the maintenance of this storage system.The renters pay only for the space used without any additionalfees pertaining to user requirements of setup costs.

    2) Omnilytics: Omnilytics is a blockchain platform forbig data analytics that provides insights for sales, marketingand merchandising industry [96]. It uses blockchain, big dataanalytics, ML, AI and various other technologies to integratedata from different industries. The platform provides dataanalytics and related services for competitor benchmarking,trend analysis and pricing analysis for the clients. Blockchainis used to empower smart contracts, distributed data fingerprinting, data exchange and other services to track the trendof data, provide incentives through micropayments.

    3) Rubix: Rubix blockchain [97] uses the concept ofdecentralization to integrate the cryptocurrency traders ina common trading platform to authenticate their credibilityand predictions. The protocol is based on the transparencyand immutability attribute of blockchain in combination withinvestment data analytics to generate more accurate tradingpredictions. The traders as a result are ranked based on theaccuracy of their predictions wherein the blockchain verifiesthe traders and incentivizes them based on the content quality.

    4) Provenance: Provenance is a blockchain platformmainly used in supply chain management that helps to gatherimportant product information and shares the same in atrusted, secure and accessible manner [98]. The blockchainarchitecture used encompasses of six participants namely - theproducer, the manufacturer, registrar, standard organizations,agents like certifiers or auditors and finally the customers.The protocol provides access to information to its consumerson origin of the products, its journey along various pointsin the supply chain, product quality and its impact on theenvironment.

    5) FileCoin: FileCoin intends to create a decentralizedstorage network that would allow traders to buy and sellstorage in an open market. The FileCoin allows users to rentstorage on devices having excess storage spaces using thefilecoin cryptocurrency. The clients spend cryptocurrencies forsharing or retrieving of the data and miners earn the filecoinsthrough storage and services of data. When the miners minea particular block, they need to submit a proof-of-space-time(PoST) to the network, which validates if a storage provider isperforming the required responsibilities for storing outsourceddata for the stipulated time frame. The filecoin consists ofblockchain, retrieval nodes, storage nodes and a native filecointoken. The storage nodes store sealed copies of data andthe transactions are recorded by the blockchain. The retrievalnodes fetches and delivers the files to the users abiding to thePoST [99].

  • 12

    Certifiers

    Provenance

    Blockchain Enabled Data

    Security Project

    Omnilytics

    Rublix

    Blockchain Enabled Data

    Analysis Project

    Decentralized Cloud Storage

    Project

    Spread Audit

    ShredEncrypt

    Storj

    Fig. 4. Popular blockchain big data projects.

    6) Datum: Datum (DAT) is a decentralized, distributed,high performance and NOSQL platform supported byEthereum, Bigchain DB and IPFS. It basically enables usersto store data anonymously and securely from social network,IoT devices and wearable technologies. The platform also actsas a marketplace for sharing and selling of data by providingDatum users with a unique Datum ID which are managedby the Datum mobile application available for Android andiOS services [100]. The summary of big data blockchainapplications is described in Table II.

    V. RESEARCH CHALLENGES AND FUTURE DIRECTIONS

    The robust technologies blockchain and big data are evolv-ing in almost all domains. When these powerful technologiesare integrated, the integration opens up new research opportu-nities due to massive data accumulations in today’s data cen-ters. Big data is evolving in business organizations today gain-ing higher profits. Similarly, data held up in the blockchainsworth more through its sensitive nature. Blockchain validatesthe data ensuring quality in data management, whereas bigdata analytics makes better predictions on a large quantity ofdata. These technologies have specific challenging issues tobe addressed when used individually and in combination withtheir adoption. The most prominent challenges are massivedata silos in the big data environment that should be se-cured, ensuring integrity and repudiation in data transactions.Therefore, blockchain with its decentralized framework andsecured immutable nature will be an optimal choice. Indeed,blockchain possesses some challenges to be addressed on itsdeployment. The goal of integration is to store the massivedata on the decentralized ledgers instead of centralized serverswith authorized data access [101] and allowing the users toshare their unused storage on the exchange of cryptocurrencieslike bitcoins [102]. This section presents the key challengesand future directions related to blockchain big data research.

    The summary of research challenges upon the integration ofblockchain and big data is described in Table III.

    A. Research Challenges

    1) Security in Blockchain: The blockchain, a valid ledgerkeeps track of various digital transactions across diversifieddomains such as IoT applications (includes data transactionfrom heterogeneous devices), in fifth-generation (5G) network,healthcare and financial services. Some of the notable servicesof the blockchain through decentralization are data securityand privacy demanding more computational power (about50%) for the malicious users trying to deceive the blockinformation [115]. This type of attack is called 51% attack.Though 51% of the computational resources are required forany user to deceive information from the blocks, the double-spending attack is still possible. Blockchain smart contractsreinforce the environment to avoid the double-spend attacks[103]. Blockchains’ distributed nature (which shares eachtransaction network-wide) will induce greater complexity forfraudulent block transactions. Though blockchain and bigdatais a great marriage, data security issues concerned with bigdata and the data analytics models for handling big datamust be considered. As the blockchain stores history of alltransaction in the same state as it was performed makes itan essential candidate for big data application. Data-intensiveapplications like the healthcare industry where big data isemployed for managing the voluminous data from medicalpractitioners, patients, clinicians, laboratory and pharmaceuti-cal requires privacy-preserving data sharing. The researcherssuggest that the secured blockchain framework can be em-ployed for controlled access to the voluminous data using itsdecentralized data management [12].

    Moreover, the permissionless blockchains which allow anyuser to join the chain without permission are secured by the hy-per ledger [116]. Hyperledger strengthens the permissionlessblockchain by allowing the users involved in the transaction

  • 13

    TABLE IISUMMARY OF BLOCKCHAIN BIG DATA APPLICATIONS.

    Ref. Application Description Benefits Research Challenges

    [19]

    Smart City

    Use of hash, asymmetric encryption,consensus algorithm, blockchain andmerkle tree.

    -Tamper free transactions in IoT devices.-Development of decentralized, secureand auditable environment for IoTdevices.

    -Maintaining balance between privacy andaccountability.-Implement Blockchain for crowd sensing.

    [80] Secured big data auditing scheme usingDAB.

    -Elimination of centralised third partyauditors.-Improvement in reliability and stability.

    NA.

    [81] Use of fog nodes and D2D to enableblockchain.Prediction of risk or any exceptionalevent in smart contract.

    Large scale testing in shared economyscenarios could be implemented.

    [82]

    SmartHealthcare

    Access control of patient records usingcryptographic techniques integrated withblockchain.

    Provides security and privacy for IoTbased health monitoring systems.

    -Resource constraints of IoT acts as amajor challenge.-Commercialization in collaboration withindustry partners.

    [83]

    -Sensitive information collected usingsensors are encrypted and distributedusing blockchain.-Implemented in IoT based EHRs systemsconnected through WSN.

    Ensures identity verification and frauddetection.

    Implementation on large scale healthcaredata.

    [84]Implements Ethereum smart contractwhere sensors communicate with thesmart devices.

    -Monitors patients in real-time.-Sends alerts for medical interventions.

    -Large scale adaptation.-Resource utilization.

    [85]

    SmartTransportation

    Implementation of Smart contract, track& trace, fast payment and supply chainfinance using blockchain.

    Data authentication, decentralization toprovide knowledge in shipping, logistics,transportation .

    -Varied successful applications consideringtransportation engineering still notpredominant.

    [87] Implements a key management systemfor key transport in VC systems.

    Captured vehicle departure data, ensuressecured key transport and re-keying ofvehicles.

    -Maintaining of balance between securityand privacy.-Pseudonym management can be included.

    [88]Enables users to travel between locations,make insurance and finance decisions onblockchain based disruptive technology.

    Defines privacy policies and resolvesissues to transportation route finding,car insurance and tracking of claims.

    Use of GPS positioning of the users notincluded.

    [89]

    Implements a fine grained transportationinsurance service based on vehicle usage,driver behavior using hyperledger andcryptocurrency.

    Promotes safe driving and unbiasedinsurance claims.

    Implementation in large scale in citiesand other similar applications.

    [90]

    Smart Grid

    Implementation of blockchain, distributedconsensus algorithms in energy industry.

    Innovations in P2P energy trading anddecentralized energy generation.

    Achieving market penetration andcommercial viability.

    [91] Distributed key management for AMI inSmart grids.Use of multi-case key managementscheme for security of smart grids.

    -Scalability to process transaction.-Difficulty in prediction of price due tovolatility in supply and demand.

    [92]Implementation of signcryption algorithmfor data integration and regulation systeminblockchain framework.

    -Security, of multidimensional data.-Reduction of communication costs. Real-time analysis could be included.

    [93] Implements GUARDIAN, blockchainsecured demand response management.

    -Enables accelerated decision makingfor energy trading.-Load management in residential,industrial & commercial sector.

    -Testing and deployment of the schemeon larger dataset.-Optimization of the scheme to reducelatency and increase network throughput.

    to join through permissioned blockchain guaranteeing dataprovenance. One of the possible network threats to these kindsof a public blockchain is a Sybil attack, which enables a nodein a blockchain to add enormous malicious users under itscontrol [117]. But, the PoW consensus algorithms will mitigatethese attacks by allowing the malicious node to devote moreof its computational resources to accomplish the attack. Ananalysis in [33] suggests that resource-constrained IoT envi-ronments where more data is accumulated from varied sourcescannot be secured with traditional cryptographic infrastructure.Also, the authors recommend that the secured, distributed,and anonymized nature of blockchain is essential challengerfor such environments. Furthermore, lightweight blockchainshould be preferred for optimal computational resource uti-lization in the resource-constrained IoT environment with bigdata services.

    The distributed ledger technology ensures trusted data trans-actions with immutability and transparency via peer-to-peernetworking services. Blockchain assures better scalability than

    centralized architectures. But, as the chain grows longer andlonger, the entries in the blockchain will be more and compu-tational load in processing the data will increase tremendously.In blockchain applications like IoT, the nodes are simpleand resource-constrained. Still, the security capability usingcryptographic functions in blockchain consumes more com-putational resource for key exchange, encryption, decryptionand digital signatures. The miners (the node that performsmining) which are responsible for creating new blocks andlinking it with the existing chain requires higher computationalload [118].

    Therefore, the cryptographic techniques or security mea-sures used for enhanced security in the blockchain environ-ment should not impose more computational resources. Theapplication of blockchain is increasing every day with anincrease in the complexity of the data stored in the blockchain.Henceforth, blockchain data analytics must be explored toensure better performance of blockchain with varied complex-ity of the data [104]. Also, before integrating the blockchain

  • 14

    TABLE IIIRESEARCH CHALLENGES IN INTEGRATION OF BLOCKCHAIN AND BIG DATA.

    Ref. Challenges Application Description Benefits

    [102] Security andprivacy

    enhancementin big data

    Big data and cryptocurrency. Integration of big data and cryptocurrency for decen-tralized data management.

    Secured data sharing and decentralizeddata access.

    [103] Blockchain smart contracts forbig data.Vulnerability scan and programming correctness forsecurity and correctness in smart contracts’ opera-tions.

    Secured data sharing and privacy.

    [33] IoT big data, blockchain and fogcomputing.Big data security in fog enabled IoT usingblockchain.

    Secured data transactions with low la-tency response.

    [104] Securityand privacy in

    big dataexchange

    E-crime detection and bitcoinprice predictions.

    Interpretation of data stored in public blockchain. Secured blockchain data transactions.

    [105] Smart toy assisted with MEC andblockchain.Smart contracts are used for validating various dataexchanges authorized using blockchain.

    Secured and low-latency response insmart toy business.

    [101] Big data exchange and smartcontract.A fair way for protecting user data copyright andensures privacy using SC.

    Privacy in decentralized big data shar-ing.

    [106] Blockchainstandardization

    Blockchain and DLT. High-level functional architecture for blockchain andDLT.

    Standards for various functionalities inblockchain.

    [107] Early standardization forblockchain immutability.Describes different levels of standardization and theirimportance.

    Participatory standard for blockchainimmutability.

    [108]Complexity in

    big data

    Crime big data. Mining of data using various state-of-art data miningtechniques.

    Efficient mining of data for crime de-partments.

    [109]Big data in mobile network opti-mization.

    Explorers the features of big data from the perspec-tive of users and network operators.

    Effective mobile data management.

    [110]

    Computationaloverhead inblockchain

    Cyber physical social systems. A lightweight blockchain for big data. Privacy Preserving data transactionwith low-latency.

    [111] Blockchain in distributeddatabases.Blockchain on distributed databases allows 1-millionwrites for a second.

    Scalability and faster querying withsub-second latency.

    [112]Blockchain in cloud basedhealthcare big data.

    Suggests off chain computation of healthcare datawith control in the blockchain.

    Secured and immutable medical trans-actions.

    [113] Blockchain for supply chainswith 5G MEC.One-way hash and bitwise rotation make the systemlight.

    Low-latency response.

    [114] NetworkVirtualizationSDN, big data, blockchain and5G MEC.

    SDN and big data are integrated with faster 5G andimmutable blockchain.

    Faster query processing and secureddata transactions.

    with other technologies like big data, the type of blockchainpublic or private, security measures adopted, data processingcapabilities should be considered for network safety and betterperformance.

    2) Standardization: Blockchain was initially developed asthe solution to the problem of digital cash (the cryptocurrencynamed bitcoin). It facilitates secure transaction of digitalassets over different banks. Blockchain automized the globalpayment over the Internet irrespective of any topographicalconstraints within hours. Whereas the traditional financialsystem takes many days to perform any financial transactionsworldwide. Nevertheless, the scope of adoption of blockchainhas been hindered by its interoperability challenges. Thesechallenges not only include the differences among differ-ent cryptocurrencies but also consists of the differences inthe diversified transaction. Therefore, it is tedious for theblockchains to interoperate and integrate compatibly withthe legacy systems. This, in turn, may hinder the regulatoryacceptance of blockchains. One possible solution for thistype of open systems is standardization to provide commontechnical guidelines for any industry.

    An analysis of blockchain terminologies and various ini-tiatives taken by non-profit organizations for standardizationfor blockchain was carried out in [106]. For standardizingdistributed ledger technology (DLT) and blockchain, the inter-national organization for standardization (ISO) which developsand publishes standards, has formed an ISO/TC 307 technicalcommittee led by standards Australia for standardizing DLTand blockchain. The primary motive of this committee is topublish the standards related to blockchain privacy, taxonomy,

    smart contract, security (for users and data), privacy, inter-operability, governance, and various use cases of blockchain.The different workgroups and their activities under ISO/TC307 are summarized in [119]. International telecommunica-tion union (ITU), the working group under ISO, focuses onidentifying and standardizing the DLT application, its services,best practise to be adopted for its implementation and furtherresearch on related standards. The world wide web consortium(W3C) which implements web standards has initiated stan-dards for developing blockchain message formats (ISO20022),guidelines on blockchain storage (public, private and side-chain) and approving the use-cases. IEEE has developed astandard framework for blockchain use in IoT and a handbookon blockchain asset exchange. The Internet engineering taskforce (IETF), an open group that develops interoperabilitystandards for network communications has a greater impacton blockchain standardization.

    Furthermore, a framework in [107] was designed for theimplementation of blockchain immutability. Also, they havediscussed the effects of early standardization in blockchain im-mutability. They suggest that three different types of standards,namely, anticipatory, participatory, and responsive standards.The anticipatory standards are developed before the acceptanceof a new service or technology. The participatory standardis developed and adopted during the implementation of thetechnology to test the conformance specification. And theresponsive standards are adopted after technology adoption (orduring its evolution). A framework for participatory standardsto deploy the immutability concept of blockchain and itsoperation is discussed.

  • 15

    Blockchain for big data allows the data sharing in cross-domain environment irrespective of the risk factors concernedwith accumulating data from various data silos. Therefore,while adopting blockchain, proper standards and guidelinesshould adhere to the smooth functioning of the technology.

    3) Complexity of Big Data: The emergence of cloudcomputing, smart IoT applications have led to the massiveaccumulation of data. Along with the enormous growth ofdata in this information age, data management issues likeinaccessible data, dirty or unclean data, and data privacy havealso increased [120]. With the advent of big data, data qualitymanagement is more challenging. Furthermore, while handlingmore significant and complex datasets, the companies shouldensure the authenticity of data source, cleanliness, and databreach. Because of this, the complete digital transformationof entire legacy data is still a challenging issue. The secu-rity perspective of the data management can be assured byblockchain, but yet, the complexity in big data managementshould be considered on its integration.

    The prominent challenges of big data are due to the natureof the data, conventional analysis models and inefficient dataprocessing systems. The big data is inherently complex, mak-ing it challenging to represent and interpret, thereby increasingthe computational complexity. The big data has heterogenicsources that exhibit different patterns and behaviors. Someof the essential characteristics are complex data type, itsstructure, more intricate relationships and wide-ranging qual-ity. The big data mining activities such as data retrieval,analyzing the topic, text mining (sentiments and semanticsextraction) will be challenging than the traditional data [108].The lack of knowledge of these characteristics and domain-specific data processing techniques will result in inefficientcomputational models. A clear understanding of the attributesof inherently complex big data is mandatory for designing thecomputational models with the highest level of abstraction.Apart from diversified sources and massive volume, the criticalfeature of big data is its dynamically changing data (real-timeinformation) [121].

    The big data processing systems are complex enough inhandling the inherent complexity of the big data. Thesesystems were built with high processing capability with morecomputational resource requirement. The system complexityincludes the elaborate architecture, different processing modesand computing requirements. Basic knowledge of the systemcomplexity will directly impact the performance of the big datasystems. Also, the parameters affecting the energy utilizationof big data processing systems must be considered while de-signing a robust framework. Some of them are system through-put, energy consumption, resource utilization, distributed datastorage, parallel computation and accuracy in job calculation.

    Furthermore, big data offers more chances for mobile net-works to improve their service quality. The study in [109]has explored the integration of big data with mobile networkoptimization, with a focus on investigating the characteristicsof big data from the perspective of the mobile network operatorand users. The user-specific data obtained from user equipmentinclude profile (location, communication pattern), behaviorand other application data. The data of network operators

    include data from the core network, radio access network andInternet service providers. The core network provides datarelated to network performance, call details and application us-age index. Information sourced from radio access network in-cludes cell configuration, mobility, handover details, resourceutilization (source details and link utilization), interferencedetails, signal measurements and notification signal messagesamong different components in the mobile network. Theeffectiveness of the service laid by mobile network depends onhow effectively network operators process this information andmake valuable decisions. Efficient data analytic mechanismsare essential for better network optimization.

    Therefore, the complex nature of the big data must beensured while integrating it with blockchain as it will improvethe way how the data is handled in big data processing mod-els. Also, the mapping between complexity vs computation,energy consumption vs efficiency should be evaluated forlaying out effective means of data sharing, trusted transactions,data access, intruder detection and enhanced security throughdecentralized blockchains.

    B. Future Directions

    Big data which is proprietary for variability, volume, verac-ity, value and complexity, requires the data processing systemswith higher computational capability. Also, the decentral-ized distributed ledger blockchain offers immutable, secured,and transparent data transactions that require more computa-tional power for effective services. Upon integration of thesecomplicated big data with blockchain, incur an unexpectedcomputational complexity leading to the poor performanceof the system. Therefore adaptive blockchain designs shouldbe preferred, thereby alleviating the computational resourceutilization for blockchain and 5G network communication canbe utilized for faster services. This section presents the futuredirections for the integration of blockchain with big data.

    1) Adaptive Blockchain Design for Big Data: The adaptiveblockchain reduces the computational power required for pro-cessing the blocks even if the chain grows exponentially. Themost preferred adaptive blockchain designs are lightweightblockchain for real-time big data and scalable blockchain forlarge scale big data.The framework in [110] was designed forlarge scale and real-time big data application cyber-physicalsocial systems (which integrates cyber, physical and socialsystems) uses blockchain for access control. The frameworkuses fog computing at the edge nodes for processing the localdata dynamically. A lightweight symmetric algorithm is usedfor encryption for privacy-preserving data transactions. Thecyber-physical social system big data is accessed using theaccount address of bl