Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered...

Outlier Detection and Data Analytics on Individual

Household Electric Power Consumption using Hadoop

Dhiraj Nitnaware

Electronics & Telecommunication Dep, Institute of Engineering & Technology (IET),

DAVV Indore (M.P)-India,

E-mail: [email protected].

Abstract

Big data analytics has become a very well-known field in the current scenario of

computer science. The analysis of such large data occurs from the cloud computing,

sensor network; Internet-based application, social networking site. Outlier analysis is

being considered for analysis in the field of research and application. In this paper, the

analysis of household-based power consumption is performed in order to have the

observations of daily, weekly, monthly or annual generated from the user input. The

outlier detection in a large amount of data and sensing is a performed. A clustering

approach, DBSCAN algorithm is an applied in order to form clusters in floating point

dataset. The given solution has been implemented using Java technology along with

Hadoop in order to extract conclusion from given data.

Keywords: Outlier Detection, Big Data Analysis, Hadoop Server, DBSCAN algorithm.

Household Electric Power

1. Introduction

In current scenario, the data generation occurs is very large and enhances at rapid rate.

Every second person is accessing cell phones, which result in generating large amount of

data. This large amount of data is created from various sources. Big data analysis thus

become one of the widely known technologies. Research conveys that the finding of

such data can lead to much new result in business, scientific disciplines and public

sector. There is strong desire of developing system, which will lead to even better system

in terms of analysis, structure, scale, timeliness and privacy. All such need leads to

system, which will have distributed architecture instead of, centralized.

Big data researches involve Hadoop as one of the analysis tool. Hadoop is well-known

distributed ecosystem in which processing of large amount of data can be perform very

efficiently. There are variety of modules like Map reduce, Hive, Pig etc. The Apache

foundation is responsible for maintaining the environment. There are variety of

technologies to introduce distributed file system and variety of databases as well like

Mongo DB, NoSQL. The Pig is for executing query in the system while Hive is for

maintaining warehouse of Hadoop [4]. The Scoop perform data uploading and

downloading.

The Map Reduce is a framework, which comprises of two well-known functions Map

and reduces that is used to process a large volume of data. The Hadoop infrastructure is a

used in Linux that resulted in reduced cost.

Journal of Information and Computational Science

Volume 9 Issue 9 - 2019

ISSN: 1548-7741

www.joics.org759

For distributed file system, HDFS can be a used, which relies on Java based

technology. It is a used in storing data at reliable and scalable level. HDFS is capable of

handling highly fault tolerant data. HDFS can perform analysis on very large volume of

data without less processing time hence is highly used and desirable. So many servers are

held for performing computation in order to handle failure. Any kind of data storage can

be performed like schema-less or structure and unstructured data.

Fig 1: Block Representation of Hadoop Components

Fig 2: Block Representation of Core Apache Hadoop Server

The purposes of this paper is to find a model to forecast the electricity consumption in

a household and to find the most suitable forecasting period whether it should be on

daily, weekly, monthly and yearly basis. Measurements of electric power consumption in

one household with a one-minute sampling rate over a period of almost 4 years is done.

Different electrical quantities and some sub-metering values are available. The time

series data in our paper was individual household electric power consumption from

December 2006 to November 2010. This assignment uses data from the UC Irvine

Machine Learning Repository, a popular repository for machine learning datasets. In

particular, we will be using the "Individual household electric power consumption Data

Set" which is available in [8].

The outbreak can be effectively, performed in a distributed environment, hence

considerable care is a desired. The security approaches have applied for it, like

encryption and authentication. It is possible in certain cases that big data can have



ISSN: 1548-7741

www.joics.org760

personal information about others. A block diagram of Hadoop component and server is

given in figure 1 and 2.

2. RELATED WORK

The electric consumption of data being found in the household activity is a calculated

in different categories as daily, weekly, monthly and quarterly by the author of paper [5].

The analysis is performed for time series data of electric consumption between the

periods from December 2006 to November 2010. Two models were used which are

ARIMA (Autoregressive Integrated Moving Average) and ARMA (Autoregressive

Moving Average). The most suitable forecasting period where chosen by taking the

smallest value of AIC (Akaike Information Criterion) and RMSE (Root Mean Square

Error), respectively.

The result reveals that out of both the models, ARIMA is better if monthly and

quarterly calculations is done whereas if daily and weekly observations are considered

ARMA is more suitable. Both the method is suitable for a short interval of time. The

time series process is shown in figure 3.

Figure 3: The framework of steps for building the forecasting time series models

There are certain drawbacks is also observed as data distribution is not performed as

the composing and decomposing of data in one shot. The method cannot apply to a large

dataset and cannot be performed in the parallel computing environment. Hence, there is a

need for an architecture, which is capable of processing the large dataset with

intelligence and can analyze household data with least possible overhead.

Usman Ali et. al. [6] studies that the communication between consumer and supplier

can be an improved by forming the Smart Grid in electricity grid infrastructure. The

smart meters where formed which are capable of reading the smart data by reading the

profile of each customer thus result in even better analysis. The consumption of data

should be a properly analyzed in order to have proper planning and development. In

order to have the best analysis, different methods are taken like exploratory data analysis

and preprocessing, frequent patterns mining and associations, classification

characterization, clustering and outlier deduction. In the given work, the study is

performed on two publicly available datasets. The evaluation is a done in order to find

which scheme is better.



ISSN: 1548-7741

www.joics.org761

Figure 4: Method proposed by Usman [5]

The Public Energy Disaggregation Research dataset is taken by the author and hourly

mapping is a done on weekdays. Power consumption database for single house is a not

observed in this study hence noncommercial analysis is not performed. The knowledge

extraction is focused but noise present in it is ignored, also redundancy are not observed.

In presence of minimum overhead, the demand of intelligent model for processing and

distributing data is realized.

As the system is focusing on data mining, it is obvious that data mining always has

cloud computing associated with it, hence the study of Hadoop and clustering algorithm

is desired. Some literature in reference to the cloud-based model and implementations

details is a mentioned below.

Konstantin Shvachko et al. explore that the Big-data is seeking attention in the current

scenario at very large extent. In it, the study of various tools needed and cloud-based

environment is given and mentioned as follows:

2.1 Hadoop

Hadoop is one of the highly reliable and scalable models, which is based on

distributed computing and is an open source software. The framework used in it are Map

reduce along with distributed programming model. Various research suggests the

applications of Hadoop.

2.2 HDFS

HDFS is one the well-known file system of Hadoop framework which is low-cost

hardware where the data can be stored using HDFS. The beauty of HDFS lies in the fact

that it works in distributed architecture as shown in figure 5.



ISSN: 1548-7741

www.joics.org762

Fig 5: HDFS Architecture

2.3 PIG

Then comes another important component of HDFS, which is PIG. It consist of two

main component: the initial component is language used in it called as Pig Latin and

latter is the runtime environment where the program written in PigLatin is an executed.

2.4 MapReduce

Map reduce is the framework in order to have distributed processing in the form of

clusters. The logic behind designing this framework is perform parallel processing of

data in distributed environment in reliable and fault-tolerant manner.

Chetan Dharni et al [1] also work in the field of outlier analysis and study the

DBSCAN algorithm. The working of algorithm is that any point can be term as core

point if all the point around the given point is at the distance mentioned by distance

function. The points present near to the core points are border points. All the point,

which are neither core nor border, are noise point. The steps included in the algorithm is

given below:

• Label all points as core, border or noise points

• Eliminate noise points

• Put an edge between all core points that are within Eps of each other

• Separate cluster are designed which contains group of all connected core points

• Assign each border point to one of the clusters of its associated core points.



ISSN: 1548-7741

www.joics.org763

Fig 6: Arbitrary Shape cluster of DBSCAN

Fig 7: Border and Core Representation of DBScan .

3. Proposed Method

The researchers observed that big Data and cloud computing are highly correlated

fields. The major contribution in it is due to framework like Map reduce. As mentioned

above, the detection of outlier and its analysis is a difficult task. Researchers adopt

concepts from diverse disciplines such as statistics, machine learning, data mining,

information theory, spectral theory, and apply them to specific problem formulations.

The absolute solution has implemented into Hadoop framework with the help of

MapReduce component and Hadoop Distributed File System. The following steps are a

proposed for outlier’s detection and other analytical use case development for man-to-

man household power consumption.

The proposed solution considers a dataset of individual household power consumption

data for one main meter and two sub-meter. Leading attributes from data set are be listed

below:

• date: Date in format dd/mm/yyyy

• time: time in format hh:mm:ss



ISSN: 1548-7741

www.joics.org764

• global_active_power: household global minute-averaged active power (in

kilowatt)

• global_reactive_power: household global minute-averaged reactive power (in

kilowatt)

• voltage: minute-averaged voltage (in volt)

• global_intensity: household global minute-averaged current intensity (in ampere)

• sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It

corresponds to the kitchen, containing mainly a dishwasher, an oven and a

microwave (hot plates are not electric but gas powered).


corresponds to the laundry room, containing a washing machine, a tumble-drier, a

refrigerator and a light.


corresponds to an electric water-heater and an air-conditioner.

A Mapper class with the name as Key and Attribute has been developed to split the

complete data into multiple parts. DBSCAN has widely used algorithm for floating value

and considers anyone attribute as the core point and range as the epsilon value for group

preparation.

In order to have parallel processing and preparation of the group of dataset DBSCAN

Clustering approach is used. Epsilon defines the range of value, which helps to decide

where the value of selected attribute relate with cluster or not. In the case of irrelevant

information, it considers as an outlier and does not involve into cluster object.

Daily, Weekly, Monthly, Quarterly, Half-Yearly consumption and other use cases

will be developed for analysis purpose. The algorithm is performed using Java as

language in order to input and output the desired details. For performance analysis, the

computation time is calculated. Hadoop framework is used and three nodes based

distributed environment compares the findings. A block representation of proposed

solution is a shown below figure 8.

Fig 8: Proposed Solution



ISSN: 1548-7741

www.joics.org765

4. Result And Analysis

The complete work has been analyzed based on hour, weekly, Monthly and yearly

performance.

4.1 Hourly Analysis

Fig 9: Hourly Analysis of Sub-metering 1

Fig 10: Hourly Analysis of Sub-metering 2

Figure 11: Hourly Analysis of Sub-metering 3



ISSN: 1548-7741

www.joics.org766

An analysis of submeter 1 state that consumption of electricity in kitchen is less during

initial hours but raise with day and reach up to maximum value. In case of submeter 2,

heavy start suddenly saturated to maximum low from hour 2 to 5 but raise with respect to

day. In case of submeter 3, stable results are recorded as shown in figure 9-11. This

analysis can be considered because it justifies the routine use of kitchen work and power

requirement.

4.2 Weekly Analysis

Fig 12: Weekly Analysis of Sub-metering 1.

Fig 13: Weekly Analysis of Sub-metering 2.

Fig14: Weekly Analysis of Sub-metering 3.



ISSN: 1548-7741

www.joics.org767

For submeter 1 and submeter 2 saturated but within a limited range values are

observed and recorded for all week. Besides all this value of submeter 3 completely

observed with stable values as given in figure 12-14.

4.3 Monthly Analysis

For submeter 1, 2 and 3 saturated outcome has been recorded for all months as per

figure 15-17.

Fig 15: Monthly Analysis of Sub-metering 1.

Fig 16: Monthly Analysis of Sub-metering 2.

Fig 17: Monthly Analysis of Sub-metering 3



ISSN: 1548-7741

www.joics.org768

4.4 Yearly Analysis

Year wise analysis of global active power, reactive power and intensity conclude that

worm start was saturated to maximum value in mean years and slightly low into recent

years. Besides this, a slight hike in values has been observed for submeter 1 and 2. In

case of submeter 3, a sudden raise has been observed in between 2007 and 2008 and

reach with stable and maximum value from 2008 to till 2010 as shown in figure 18-20.

Fig 18: Yearly Analysis of Sub-metering 1.





ISSN: 1548-7741

www.joics.org769

5. Conclusion And Future Scope

A concluding remark help to evaluate the intensity and significance of the work for

proposed platform. Certain conclusion has been observed as, power consumption

analysis has been a performed and observed for sub-meter data of Clamart city, France.

It has been a performed on basis of daily, weekly, monthly and yearly. Analysis of Sub-

meter 1 and 2 concludes that heavy but stable power consumption happened in between

morning 7 to midnight 12 o’clock. It reduces into night session. For Sub-meter 3 it is

concluded that stable result has been an observed with hourly raise in power

consumption in recent years with respect to the previous decade.

Heavy fluctuation has been observed into a yearly graph and justifies the variable

power consumption for the whole year. Proposed solution attempt to remove the outliers,

noise from the existing dataset, and enhance the accuracy of analytics. Overall

conclusion is that heavy power consumption has been an observed into morning, evening

and midnight timing.

References

6.1 Journal Article

[1] Chetan Dharni, Meenakshi Bansal, "Survey on Improved DBSCAN Algorithm",

International Journal of Computer Science and Technology Volume 4, Issue 2, April

- June (2013).

[2] Balaji K. Bodkhe and Dr. Sanjay P.Sood, “Analysis of Smart Meter Data Using

Hadoop", International Journal of Current Engineering and Scientific Research

(IJCESR) VOLUME-2, ISSUE-9, (2015).

[3] Tabhane, Samrudhi, and R. A. Fadnavis, "Large data computing using Clustering

algorithms based on Hadoop", International Journal of Engineering Research and

General Science Volume 3, Issue 2, March-April, (2015).

6.2. Conference Proceedings

[4] Shvachko, Konstantin, Hairong Kuang, Sanjay Radia, and Robert Chansler, "The

Hadoop distributed file system." In Mass storage systems and technologies (MSST),

2010 IEEE 26th symposium on, pp. 1-10. IEEE, (2010).

[5] Chujai, Pasapitch, Nittaya Kerdprasop, and Kittisak Kerdprasop, "Time series

analysis of household electric consumption with ARIMA and ARMA models", In

Proceedings of the International Multiconference of Engineers and Computer

Scientists, vol. 1, pp. 295-300. (2013).

[6] Ali, Usman, Concettina Buccella, and Carlo Cecati, "Households electricity

consumption analysis with data mining techniques," In Industrial Electronics Society,

IECON 2016-42nd Annual Conference of the IEEE, pp. 3966-3971. IEEE, (2016).

[7] Sauhats, Antans, Renata Varfolomejeva, Olegs Lmkevics, Romans Petrecenko, Maris

Kunickis, and Mans Balodis, "Analysis and prediction of electricity consumption

using smart meter data", In Power Engineering, Energy and Electrical Drives

(POWERENG), 2015 IEEE 5th International Conference on, pp. 17-22. IEEE, (2015).

[8] www.archive.ics.uci.edu.



ISSN: 1548-7741

www.joics.org770

Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered...

Documents

Transcript of Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered...