Processing and Securing Healthcare Datasets Through Hadoop ... · processing the large datasets...

8
Processing and Securing Healthcare Datasets Through Hadoop And Implementing Cryptography Technique 1 Dr.E.Laxmi Lydia Associate Professor & Big Data Consultant Computer Science and Engineering Vignan's Institute of Information Technology, India [email protected] AbstractThis paper deals with identifying the healthcare data, transmitting in a secure manner using Cryptography and Hadoop techniques. At present in healthcare system, large amount of data i.e. big data is arriving from multiple sources with high velocity, volumes and varieties. Some of the healthcare functions like patient registries, discharge and admission data are stored in Computer Based Patient Records (CPR) and Electronic Health Records (EHR). The information regarding the healthcare must be sent from sender to receiver without any interruptions from the third party (i.e. attackers). In this project we will implement the solution through cryptography techniques and we are encrypting and decrypting the information to provide security. In this project we are proposing Advanced Encryption Standard (AES). This can be achieved by first encrypting the data at the sender’s side and decrypting at receiver’s side. This is done so that the data that is transmitted is secure. In addition to this Hadoop technology plays a key role in improving the quality of healthcare as it is useful in delivering the correct information at the right time to the right people thereby reducing the cost and time. Map Reduce application usually makes use of jar files which contains combination of MR code and PIG queries. In order to obtain accurate results we can make use of these techniques and technologies. Through these techniques we can obtain accurate results. These results can later be used for further analysis in healthcare. KeywordsMap Reduce; Hadoop; Big Data; Cryptography; AES; I. INTRODUCTION Big Data mainly deals with different techniques and various technologies that are used for capturing, managing and processing the large datasets which are usually of petabyte or larger in size. The data can be structured, unstructured or semi-structured. The data that is gathered is from various sources such as Healthcare, Business, Banks, Institutions, Traffic, Sensors, And Cell phones etc and are arriving into the system at various rates. For processing the data that is generated in large amounts, Hadoop technology is used which is generally efficient and cost effective. At present managing this Big Data (i.e. large volumes of data) is one of the most challenging issues for Hadoop cluster. Figure 1 demonstrates big data in various fields. 2 P.Harika, ,3 S.Poorna Chandrika, 4 N.G.N.L.Sai Tejaswi, 5 P.Karthik UG Student ,3,4,5 Computer Science and Engineering, Vignan's Institute of Information Technology, Duvvada, Visakhapatnam, Andhra Pradesh, India Fig. 1. Big Data in various fields A. 3 V’s of Big Data Volume of data: Volume generally refers to the amount of data i.e. being generated on a daily basis. The size of the data that is being collected and stored is varying from megabytes and gigabytes to petabytes. Velocity of data: Velocity is defined as the speed at which the data is generated. For example, in social media sites like Facebook , Twitter every day more than 800million photos, 500 million videos and other data are being uploaded. Variety of data: The exploded data that is stored and transmitted may vary from structured to unstructured, semi-structured, videos, audios etc. The data is different from one application source to another application. B. Cryptography Nowadays it has become a fashion to transmit multimedia data by means of the all-pervasive Internet. With the advent of electronic commerce, it has become extremely essential to tackle the sensitive issue of affording data security, especially in the ever-zooming open network environment of the modern era. The encrypting technologies of time-honored cryptography are generally employed extensively to shelter data safety[16-18]. Cryptography can be defined as the process of encrypting and decrypting the data. Cryptography enables the sender for International Journal of Pure and Applied Mathematics Volume 118 No. 7 2018, 333-339 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 333

Transcript of Processing and Securing Healthcare Datasets Through Hadoop ... · processing the large datasets...

Page 1: Processing and Securing Healthcare Datasets Through Hadoop ... · processing the large datasets which are usually of petabyte or larger in size. The data can be structured, unstructured

Processing and Securing Healthcare Datasets Through Hadoop And Implementing Cryptography Technique

1Dr.E.Laxmi Lydia

Associate Professor & Big Data Consultant Computer Science and Engineering

Vignan's Institute of Information Technology, India [email protected]

Abstract— This paper deals with identifying the healthcare

data, transmitting in a secure manner using Cryptography and

Hadoop techniques. At present in healthcare system, large

amount of data i.e. big data is arriving from multiple sources

with high velocity, volumes and varieties. Some of the healthcare

functions like patient registries, discharge and admission data are

stored in Computer Based Patient Records (CPR) and Electronic

Health Records (EHR). The information regarding the

healthcare must be sent from sender to receiver without any

interruptions from the third party (i.e. attackers). In this project

we will implement the solution through cryptography techniques

and we are encrypting and decrypting the information to provide

security. In this project we are proposing Advanced Encryption

Standard (AES). This can be achieved by first encrypting the

data at the sender’s side and decrypting at receiver’s side. This is

done so that the data that is transmitted is secure. In addition to

this Hadoop technology plays a key role in improving the quality

of healthcare as it is useful in delivering the correct information

at the right time to the right people thereby reducing the cost and

time. Map Reduce application usually makes use of jar files

which contains combination of MR code and PIG queries. In

order to obtain accurate results we can make use of these

techniques and technologies. Through these techniques we can

obtain accurate results. These results can later be used for

further analysis in healthcare.

Keywords— Map Reduce; Hadoop; Big Data; Cryptography; AES;

I. INTRODUCTION

Big Data mainly deals with different techniques and

various technologies that are used for capturing, managing and processing the large datasets which are usually of petabyte or larger in size. The data can be structured, unstructured or semi-structured. The data that is gathered is from various sources such as Healthcare, Business, Banks, Institutions, Traffic, Sensors, And Cell phones etc and are arriving into the system at various rates. For processing the data that is generated in large amounts, Hadoop technology is used which is generally efficient and cost effective. At present managing this Big Data (i.e. large volumes of data) is one of the most challenging issues for Hadoop cluster. Figure 1 demonstrates big data in various fields.

2P.Harika,

,3S.Poorna Chandrika,

4N.G.N.L.Sai Tejaswi,

5

P.Karthik

UG Student,3,4,5

Computer Science and Engineering,

Vignan's Institute of Information Technology,

Duvvada, Visakhapatnam, Andhra Pradesh, India Fig. 1. Big Data in various fields

A. 3 V’s of Big Data

Volume of data: Volume generally refers to the amount of data i.e. being generated on a daily basis. The size of the data that is being collected and stored is varying from megabytes and gigabytes to petabytes.

Velocity of data: Velocity is defined as the speed at

which the data is generated. For example, in social media sites like Facebook , Twitter every day more than 800million photos, 500 million videos and other data are being uploaded.

Variety of data: The exploded data that is stored and

transmitted may vary from structured to unstructured, semi-structured, videos, audios etc. The data is different

from one application source to another application.

B. Cryptography Nowadays it has become a fashion to transmit multimedia

data by means of the all-pervasive Internet. With the advent of

electronic commerce, it has become extremely essential to tackle

the sensitive issue of affording data security, especially in the

ever-zooming open network environment of the modern era. The

encrypting technologies of time-honored cryptography are

generally employed extensively to shelter data safety[16-18].

Cryptography can be defined as the process of encrypting and

decrypting the data. Cryptography enables the sender for

International Journal of Pure and Applied MathematicsVolume 118 No. 7 2018, 333-339ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

333

Page 2: Processing and Securing Healthcare Datasets Through Hadoop ... · processing the large datasets which are usually of petabyte or larger in size. The data can be structured, unstructured

storing useful information and transmit it across the internet (which is an insecure network). From the below figure2, data that can be read and understood without any special measure is called a plain text. This plain text is now encrypted so that, data cannot be accessed by the intruders. This encryption process resulted in a text that cannot be understood and is defined as cipher text. The process of retrieving the actual data (ie plain text) from cipher text is called decryption. The encryption and decryption process is carried out by using public or private key or by both keys. Modern Cryptography has different objectives like:

Authentication: Authentication refers to ensuring of delivery of messages from authentic source that have not been altered by an intruder.

Confidentiality: It can be defined as the process of

encrypting the data that is sent from sender to receiver using various Cryptographic techniques, so as to reduce the impact of the attackers or intruders.

DES (Digital Encryption Standard)

Triple DES

D. Asymmetric Encryption

In asymmetric encryption, to transmit the patient records initially we need two keys i.e. private and public keys. Public key is used to encrypt the information at sender and to decrypt/decode we use private key. Encryption techniques are one of the ways for protecting our healthcare data sets from authorized access. Figure 4 demonstrating asymmetric encryption.

Fig. 2. Cryptography Encryption Process

C. Symmetric Encryption

Symmetric Encryption makes use of same key to convert the message (i.e. plain text) to cipher text. At the receiver side we need to decrypt using the key that is used for encryption. Figure 3 demonstrating the symmetric encryption.

Fig. 4. Asymmetric Encryption Process Ex: Some of the types of asymmetric algorithms are

RSA

Diffie-Hellman Algorithm

Digital Signature Algorithm

ElGamal Algorithm

II. ADVANCED ENCRYPTION STANDARD (AES)

AES is a symmetric block cipher that uses repetitive rounds where each round consists of four different stages. The block and the key sizes can be 128bits and 128, 192 or 256 bits respectively. The key length is assumed as 128 because many AES parameters like number of rounds depend upon key length. The 128 bit block is divided into four groups of four bytes each. So the first four bytes of first group occupy first column of input matrix and the second group of four bytes occupy second column and so on [7]. Figure 5 demonstrating AES mechanism.

Fig. 3. Symmetric Encryption Process

The sender uses the encryption technique to transmit a secured message. This means that transmitting the information from one physician to another physician about a patient. The lab results can also be shared across the network. Ex: Some of the types of symmetric algorithms are:

AES (Advanced Encryption Standard)

Blow Fish

IDEA (International Data Encryption Algorithm) Fig. 5. AES Working Mechanisms

International Journal of Pure and Applied Mathematics Special Issue

334

Page 3: Processing and Securing Healthcare Datasets Through Hadoop ... · processing the large datasets which are usually of petabyte or larger in size. The data can be structured, unstructured

The cipher includes N rounds based on the key length. The initial N-1 round includes four different transformation functions such as Sub Bytes, Shift Rows, Mix columns and Add Round key. Every transformation includes either one or more 4x4 matrices as input and generates a 4x4 matrix as output. The output of final round will be a cipher text. The key expansion function produces N+ 1 round key, each of which is a separate 4x4 matrix. Here, every round key facilitates as one of the inputs to the Add Round Key transformation.

III. LITERATURE REVIEWS

The increasing digitization of health care information is analyzing using new techniques for improve the quality of care, health care results, and minimize the costs. Organizations must analyze internal and external patient information to more accurately measure risk and outcomes. At the same time, many clients are working to increase data transparency to produce new insight knowledge.

Praveen Kumar et al. [1], in their work propose that Hadoop based on Map Reduction is a powerful tool manages the huge amount of data. With this echo system can use fault tolerant techniques.

Emad A Mohammed et al. [2] , in their work big clinical data analytics would emphasize modelling of whole interacting processes in clinical settings and clinical datasets can be evolution of ultra-large-scale-datasets. Arantxa Duque Barranchina et al. [3] proposed that using Hadoop techniques large datasets can be used to identification of large dataset.

Divya et al. [4], for protecting the data used a progressive encryption scheme. Hong song chen [6], in their research article a novel Hadoop-based biosensor Sunspot wireless network architecture, ECC digital signature algorithm, MySQL database and Hadoop HDFS cloud storage; security administrator can use it to protect and manage key data.

Lidong Wang et al. [15] , in their work based on SWOT(strengths, weaknesses, Opportunities, Threats) analysis, Radio Frequency Identification Technology(RFID).

A. Existing Work

In the paper they had used cryptography base64 algorithm for the process of encryption. Base64 encoding takes the original binary data and operates on it by dividing it into tokens of three bytes. Each byte consists of eight bits. The algorithm is mainly used to avoid delimiter collisions. In this Map Reduce application uses jar files which contain a combination of MR code and PIG queries. This application also uses advanced mechanism of using UDF (User Data File) which is used to protect the health care dataset. UDFs are very commonly used to encapsulate data processing at a more granular level. De-identified personal health care system using Map Reduce, Pig Queries which are needed to be executed on the health care dataset. [14]

B. Gap Between Existing And Proposed Work

Base64 algorithm is only useful for tiny images and images

and base64 encoded data is about 33% larger than the raw data.

So more data has to be transferred over the internet. In

cryptography, base64 algorithm is not a security mechanism. Anyone can convert base64 encoded data into its original bytes, so it should not be used as a means for protecting data, but only as a format to display or store raw bytes more easily. Some bits in raw data may be missing and data may not be of 8 bit size.

IV. PROPOSED METHODOLOGY

To overcome the problems encountered in Cryptography Base64 we are proposing Advanced Encryption Standard (AES). At present the power of computers is increasing and stronger algorithms are required to face the hacker attacks. The response to this requirement is AES. It is widely used standard and it uses three common key lengths 128, 192, 256 bit keys.AES has been designed in software and hardware and it works quickly and efficiently, even on small devices such as smart phones. With a larger block size and longer keys we use a 128bit block and with 128, 192,256 bit keys respectively. AES will provide more security in the long term, as it is considered unbreakable in practical use.

A. Research Methodology

For de-identifying dataset following platforms and tools are used for Big Data analytics in health care. This work follows the procedure as:

1. Data collection

2. Hadoop Cluster and Map Reduce

3. Experiments

B. Data Collection

In this paper Big Data minimum size consider as a peta byte. Big Data is available in so many sectors like Web and social networking, Machine to machine, Enormous exchange, Biometric sensor data, Human-created data, Gaming industry, Agriculture and Education departments. We have collected the data in healthcare sector related to a patient i.e. name, email id, mobile no, ssn , gender, weight etc. Figure demonstrating sample datasets.

Fig. 6. Sample Dataset

International Journal of Pure and Applied Mathematics Special Issue

335

Page 4: Processing and Securing Healthcare Datasets Through Hadoop ... · processing the large datasets which are usually of petabyte or larger in size. The data can be structured, unstructured

C. Hadoop Cluster and Map Reduce

Hadoop is a software framework for allows processing of large datasets across the large clusters of computer. Hadoop Distributed File System is a java based distributed file system that can collect all kinds of data without prior organization. Map reduce is a software programming model for processing large set of data in parallel. Hadoop cluster is interconnected between the HDFS and Map Reduce. So we can implement the program for Hadoop cluster.

D. Experiment

Hadoop is an open source framework which is written in java

by Apache Software foundation and is used to write software

application which requires to process huge amount of data. It

works in parallel on large clusters which would have thousands of

computer nodes on the clusters. It also processes the data very

reliable manner and fault-tolerant manner. Hadoop can be

installed in cloud era operating system and after completion of

Hadoop installation automatically HDFS process will be started

with the daemons. HDFS is having two main layers Master node,

Data node, Master Node or Name Node is the master of the

system maintained and managed by the Data Node. Master Node

can split data into salve node. Data Node or Secondary Node is

providing the actual storage and having responsibility to read and

write for the clients. Map Reduce is an algorithm or concept to

process huge amount of data in a quicker way. As per its name it

can divide into Mapper and Reducer [14]. Figure 7 demonstrating

hadoop application. Fig. 7. Hadoop applications and infrastructure interactions

V. RESULT ANALYSIS A. Preliminary Dataset Preparation

This work can be involved dummy patient health patient dataset collected in the HSCIC (Health & Social Care Information Centre) contains fields of patient name, patient id, date of birth, Email id, gender, disease, and SSN. The data can be maintained in the format of CSV (Comma Separated Value)file[14].

B. Preliminary Data Analysis

The dataset is in CSV (Comma Separated Value) format. The

results in this project consisting by using Advanced Encryption

Standard Algorithm (AES) to encrypt the plain text

into encrypted data. Here by using single node system we are setting class path for Hadoop jar files. Figure 8 demonstrating sample dataset in the HDFS. Figure 9 demonstrating the execution process. Fig. 8. Health care_sample_Dataset1 (plain text)

Fig. 9. Result of added manifest

C. Hadoop Cluster Result

After that placing dataset into Hadoop distributed file system, and run the Hadoop job and finally got the output file. Internally the Mapper and Reducer process will be started. Figure 10 to 16 demonstrating the execution of sample datasets and their results in the hadoop cluster.

International Journal of Pure and Applied Mathematics Special Issue

336

Page 5: Processing and Securing Healthcare Datasets Through Hadoop ... · processing the large datasets which are usually of petabyte or larger in size. The data can be structured, unstructured

Fig. 10. Result of Mapper &Reducer initial stages

Fig. 11. Result of status of Mapper Reducer

Fig. 12. Result of Mapper &Reducer final stages

Then we can check the name node status & job tracker in the browser

Fig. 13. Result of output job file details

Fig. 14. Result of Hadoop Counters Fig. 15. Result of Mapper and Reducer completion graph

International Journal of Pure and Applied Mathematics Special Issue

337

Page 6: Processing and Securing Healthcare Datasets Through Hadoop ... · processing the large datasets which are usually of petabyte or larger in size. The data can be structured, unstructured

Fig. 16. Result of final decrypted output(Encrypted Text)

VI. CONCLUSION

Big data have the potential to improve healthcare delivery by changing healthcare delivery from reporting to predicting results earlier stages. This paper describes the 3V’s of Big Data, Hadoop which is an open source management of data and their role in healthcare .In this paper initially we considered a dataset containing fields such as id, name, gender, email id, diseases etc. We analyze the information that is received and it is accurate. Thus we can obtain the information that is accurate and is cost effective.

References [1] Praveen Kumar, et al, “Efficient Capabilities of Processing of Big Data

using Hadoop Map Reduce”, International journal of Advanced Research in computer and Communication Engineering, vol. 3, No. 6, 2014.

[2] Emad A. Mohammed, et al, “Applications of the Map Reduce

programming Frame work to clinical Big Data analysis: current landscape and future trends”, Big Data Mining, 2014.

[3] Arantxa Duque Barrachina and Aisling O’Driscoll, “A Big Data

methodology for categorizing technical support requests using Hadoop and Mahout”, journal of Big Data, 2014.

[4] K. Divya, N.sadhasivam, “Secure Data Sharing in Cloud Environment

Using Multi Authority Attribute Based Encryption”, International Journal Of Innovative Research in Computer and Communication Engineering, vol. 2, No. 1, 2014.

[5] Sabia and SheetalKalra, “Application of Big Data: Current Status and

Future Scope”, International Journal of Computer Applications, vol. 3, No. 5, pp. 2319-2526,2014.

[6] Hongsong Chen and Zhongchuan Fu, “Hadoop-Based Healthcare

Information System Design and Wireless Security Communication Implementation”, Hindawi Publishing Corporation Mobile Information Systems,2015.

[7] Shankar, K., and P. Eswaran. "Sharing a secret image with encapsulated

shares in visual cryptography." Procedia Computer Science 70 (2015): 462-468.

[8] Priyanka k, Prof. Nagarathnakulennavar, “A Survey on Big Data

Analtyics in Health Care”, International Journal of Computer Science and Information Technologies, vol. 5, No. 4, pp. 5865-5858, 2014.

[9] Aditi Bansal, Ankita Deshpande, Priyanka Ghare, SeemaDhikale, and

BalajiBodkhe, “Healthcare Data Analysis using Dynamic Slot Allocation in Hadoop”, International Journal of Recent Technology and Engineering(IJRTE), Vol. 3, No. 5, pp. 2277-3878, 2014.

[10] A Technical Review on, “Protecting Big Data Protection Solutions for

the Business Data Lake”, White Paper, 2015. [11] D. Peter Augustine, “Leveraging Big Data Analytics and Hadoop in

Developing India’s Health Care Services” International Journal of Computer Applications, Vol. 89, No. 16, 2014.

[12] Muni Kumar N, Manjula R., et al., “Role of Big Data Analytics in Rural

Health Care – A Step Towards SvasthBharath “, International Journal of Computer Science and Information Technologies, Vol. 5, No. 6, pp. 7172-7178, 2014.

[13] HarshawardhanS.Bhosale, Prof. DevendraP.Gadekhar”, A Review Paper

on Big Data and Hadoop”, International Journal of Scientific and Research Publications, Volume 4, Issue 10, October 2014.

[14] DasariMadhav, B.V. Ramana, “De-identified Personal Health Care

System Using Hadoop” International Journal of Electrical and Computer Engineering(IJECE) Vol. 5, No. 6, December 2015, pp. 1492-1499 ISSN: 2088-8708.

[15] Lidong Wang, Cheryl Ann Alexander, “Medical Applications and

Healthcare Based on Cloud Computing” International Journal of Cloud Computing and Services Science (IJ_CLOSER), vol. 2, No. 4, pp. 217-225,2014.

[16] Shankar, K., and P. Eswaran. “RGB based multiple share creation in

visual cryptography with aid of elliptic curve cryptography.” China Communications 14.2 (2017): 118-130.

[17] Shankar, K., and P. Eswaran. "A secure visual secret share (VSS) creation scheme in visual cryptography using elliptic curve cryptography with optimization technique." Australian Journal of Basic & Applied Science 9.36 (2015): 150-163.

[18] Shankar, K., and P. Eswaran. "RGB-Based Secure Share Creation in Visual Cryptography Using Optimal Elliptic Curve Cryptography Technique." Journal of Circuits, Systems and Computers 25.11 (2016): 1650138.

International Journal of Pure and Applied Mathematics Special Issue

338

Page 7: Processing and Securing Healthcare Datasets Through Hadoop ... · processing the large datasets which are usually of petabyte or larger in size. The data can be structured, unstructured

339

Page 8: Processing and Securing Healthcare Datasets Through Hadoop ... · processing the large datasets which are usually of petabyte or larger in size. The data can be structured, unstructured

340