Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered...
Transcript of Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered...
![Page 1: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/1.jpg)
Outlier Detection and Data Analytics on Individual
Household Electric Power Consumption using Hadoop
Dhiraj Nitnaware
Electronics & Telecommunication Dep, Institute of Engineering & Technology (IET),
DAVV Indore (M.P)-India,
E-mail: [email protected].
Abstract
Big data analytics has become a very well-known field in the current scenario of
computer science. The analysis of such large data occurs from the cloud computing,
sensor network; Internet-based application, social networking site. Outlier analysis is
being considered for analysis in the field of research and application. In this paper, the
analysis of household-based power consumption is performed in order to have the
observations of daily, weekly, monthly or annual generated from the user input. The
outlier detection in a large amount of data and sensing is a performed. A clustering
approach, DBSCAN algorithm is an applied in order to form clusters in floating point
dataset. The given solution has been implemented using Java technology along with
Hadoop in order to extract conclusion from given data.
Keywords: Outlier Detection, Big Data Analysis, Hadoop Server, DBSCAN algorithm.
Household Electric Power
1. Introduction
In current scenario, the data generation occurs is very large and enhances at rapid rate.
Every second person is accessing cell phones, which result in generating large amount of
data. This large amount of data is created from various sources. Big data analysis thus
become one of the widely known technologies. Research conveys that the finding of
such data can lead to much new result in business, scientific disciplines and public
sector. There is strong desire of developing system, which will lead to even better system
in terms of analysis, structure, scale, timeliness and privacy. All such need leads to
system, which will have distributed architecture instead of, centralized.
Big data researches involve Hadoop as one of the analysis tool. Hadoop is well-known
distributed ecosystem in which processing of large amount of data can be perform very
efficiently. There are variety of modules like Map reduce, Hive, Pig etc. The Apache
foundation is responsible for maintaining the environment. There are variety of
technologies to introduce distributed file system and variety of databases as well like
Mongo DB, NoSQL. The Pig is for executing query in the system while Hive is for
maintaining warehouse of Hadoop [4]. The Scoop perform data uploading and
downloading.
The Map Reduce is a framework, which comprises of two well-known functions Map
and reduces that is used to process a large volume of data. The Hadoop infrastructure is a
used in Linux that resulted in reduced cost.
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org759
![Page 2: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/2.jpg)
For distributed file system, HDFS can be a used, which relies on Java based
technology. It is a used in storing data at reliable and scalable level. HDFS is capable of
handling highly fault tolerant data. HDFS can perform analysis on very large volume of
data without less processing time hence is highly used and desirable. So many servers are
held for performing computation in order to handle failure. Any kind of data storage can
be performed like schema-less or structure and unstructured data.
Fig 1: Block Representation of Hadoop Components
Fig 2: Block Representation of Core Apache Hadoop Server
The purposes of this paper is to find a model to forecast the electricity consumption in
a household and to find the most suitable forecasting period whether it should be on
daily, weekly, monthly and yearly basis. Measurements of electric power consumption in
one household with a one-minute sampling rate over a period of almost 4 years is done.
Different electrical quantities and some sub-metering values are available. The time
series data in our paper was individual household electric power consumption from
December 2006 to November 2010. This assignment uses data from the UC Irvine
Machine Learning Repository, a popular repository for machine learning datasets. In
particular, we will be using the "Individual household electric power consumption Data
Set" which is available in [8].
The outbreak can be effectively, performed in a distributed environment, hence
considerable care is a desired. The security approaches have applied for it, like
encryption and authentication. It is possible in certain cases that big data can have
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org760
![Page 3: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/3.jpg)
personal information about others. A block diagram of Hadoop component and server is
given in figure 1 and 2.
2. RELATED WORK
The electric consumption of data being found in the household activity is a calculated
in different categories as daily, weekly, monthly and quarterly by the author of paper [5].
The analysis is performed for time series data of electric consumption between the
periods from December 2006 to November 2010. Two models were used which are
ARIMA (Autoregressive Integrated Moving Average) and ARMA (Autoregressive
Moving Average). The most suitable forecasting period where chosen by taking the
smallest value of AIC (Akaike Information Criterion) and RMSE (Root Mean Square
Error), respectively.
The result reveals that out of both the models, ARIMA is better if monthly and
quarterly calculations is done whereas if daily and weekly observations are considered
ARMA is more suitable. Both the method is suitable for a short interval of time. The
time series process is shown in figure 3.
Figure 3: The framework of steps for building the forecasting time series models
There are certain drawbacks is also observed as data distribution is not performed as
the composing and decomposing of data in one shot. The method cannot apply to a large
dataset and cannot be performed in the parallel computing environment. Hence, there is a
need for an architecture, which is capable of processing the large dataset with
intelligence and can analyze household data with least possible overhead.
Usman Ali et. al. [6] studies that the communication between consumer and supplier
can be an improved by forming the Smart Grid in electricity grid infrastructure. The
smart meters where formed which are capable of reading the smart data by reading the
profile of each customer thus result in even better analysis. The consumption of data
should be a properly analyzed in order to have proper planning and development. In
order to have the best analysis, different methods are taken like exploratory data analysis
and preprocessing, frequent patterns mining and associations, classification
characterization, clustering and outlier deduction. In the given work, the study is
performed on two publicly available datasets. The evaluation is a done in order to find
which scheme is better.
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org761
![Page 4: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/4.jpg)
Figure 4: Method proposed by Usman [5]
The Public Energy Disaggregation Research dataset is taken by the author and hourly
mapping is a done on weekdays. Power consumption database for single house is a not
observed in this study hence noncommercial analysis is not performed. The knowledge
extraction is focused but noise present in it is ignored, also redundancy are not observed.
In presence of minimum overhead, the demand of intelligent model for processing and
distributing data is realized.
As the system is focusing on data mining, it is obvious that data mining always has
cloud computing associated with it, hence the study of Hadoop and clustering algorithm
is desired. Some literature in reference to the cloud-based model and implementations
details is a mentioned below.
Konstantin Shvachko et al. explore that the Big-data is seeking attention in the current
scenario at very large extent. In it, the study of various tools needed and cloud-based
environment is given and mentioned as follows:
2.1 Hadoop
Hadoop is one of the highly reliable and scalable models, which is based on
distributed computing and is an open source software. The framework used in it are Map
reduce along with distributed programming model. Various research suggests the
applications of Hadoop.
2.2 HDFS
HDFS is one the well-known file system of Hadoop framework which is low-cost
hardware where the data can be stored using HDFS. The beauty of HDFS lies in the fact
that it works in distributed architecture as shown in figure 5.
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org762
![Page 5: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/5.jpg)
Fig 5: HDFS Architecture
2.3 PIG
Then comes another important component of HDFS, which is PIG. It consist of two
main component: the initial component is language used in it called as Pig Latin and
latter is the runtime environment where the program written in PigLatin is an executed.
2.4 MapReduce
Map reduce is the framework in order to have distributed processing in the form of
clusters. The logic behind designing this framework is perform parallel processing of
data in distributed environment in reliable and fault-tolerant manner.
Chetan Dharni et al [1] also work in the field of outlier analysis and study the
DBSCAN algorithm. The working of algorithm is that any point can be term as core
point if all the point around the given point is at the distance mentioned by distance
function. The points present near to the core points are border points. All the point,
which are neither core nor border, are noise point. The steps included in the algorithm is
given below:
• Label all points as core, border or noise points
• Eliminate noise points
• Put an edge between all core points that are within Eps of each other
• Separate cluster are designed which contains group of all connected core points
• Assign each border point to one of the clusters of its associated core points.
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org763
![Page 6: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/6.jpg)
Fig 6: Arbitrary Shape cluster of DBSCAN
Fig 7: Border and Core Representation of DBScan .
3. Proposed Method
The researchers observed that big Data and cloud computing are highly correlated
fields. The major contribution in it is due to framework like Map reduce. As mentioned
above, the detection of outlier and its analysis is a difficult task. Researchers adopt
concepts from diverse disciplines such as statistics, machine learning, data mining,
information theory, spectral theory, and apply them to specific problem formulations.
The absolute solution has implemented into Hadoop framework with the help of
MapReduce component and Hadoop Distributed File System. The following steps are a
proposed for outlier’s detection and other analytical use case development for man-to-
man household power consumption.
The proposed solution considers a dataset of individual household power consumption
data for one main meter and two sub-meter. Leading attributes from data set are be listed
below:
• date: Date in format dd/mm/yyyy
• time: time in format hh:mm:ss
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org764
![Page 7: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/7.jpg)
• global_active_power: household global minute-averaged active power (in
kilowatt)
• global_reactive_power: household global minute-averaged reactive power (in
kilowatt)
• voltage: minute-averaged voltage (in volt)
• global_intensity: household global minute-averaged current intensity (in ampere)
• sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It
corresponds to the kitchen, containing mainly a dishwasher, an oven and a
microwave (hot plates are not electric but gas powered).
• sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It
corresponds to the laundry room, containing a washing machine, a tumble-drier, a
refrigerator and a light.
• sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It
corresponds to an electric water-heater and an air-conditioner.
A Mapper class with the name as Key and Attribute has been developed to split the
complete data into multiple parts. DBSCAN has widely used algorithm for floating value
and considers anyone attribute as the core point and range as the epsilon value for group
preparation.
In order to have parallel processing and preparation of the group of dataset DBSCAN
Clustering approach is used. Epsilon defines the range of value, which helps to decide
where the value of selected attribute relate with cluster or not. In the case of irrelevant
information, it considers as an outlier and does not involve into cluster object.
Daily, Weekly, Monthly, Quarterly, Half-Yearly consumption and other use cases
will be developed for analysis purpose. The algorithm is performed using Java as
language in order to input and output the desired details. For performance analysis, the
computation time is calculated. Hadoop framework is used and three nodes based
distributed environment compares the findings. A block representation of proposed
solution is a shown below figure 8.
Fig 8: Proposed Solution
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org765
![Page 8: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/8.jpg)
4. Result And Analysis
The complete work has been analyzed based on hour, weekly, Monthly and yearly
performance.
4.1 Hourly Analysis
Fig 9: Hourly Analysis of Sub-metering 1
Fig 10: Hourly Analysis of Sub-metering 2
Figure 11: Hourly Analysis of Sub-metering 3
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org766
![Page 9: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/9.jpg)
An analysis of submeter 1 state that consumption of electricity in kitchen is less during
initial hours but raise with day and reach up to maximum value. In case of submeter 2,
heavy start suddenly saturated to maximum low from hour 2 to 5 but raise with respect to
day. In case of submeter 3, stable results are recorded as shown in figure 9-11. This
analysis can be considered because it justifies the routine use of kitchen work and power
requirement.
4.2 Weekly Analysis
Fig 12: Weekly Analysis of Sub-metering 1.
Fig 13: Weekly Analysis of Sub-metering 2.
Fig14: Weekly Analysis of Sub-metering 3.
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org767
![Page 10: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/10.jpg)
For submeter 1 and submeter 2 saturated but within a limited range values are
observed and recorded for all week. Besides all this value of submeter 3 completely
observed with stable values as given in figure 12-14.
4.3 Monthly Analysis
For submeter 1, 2 and 3 saturated outcome has been recorded for all months as per
figure 15-17.
Fig 15: Monthly Analysis of Sub-metering 1.
Fig 16: Monthly Analysis of Sub-metering 2.
Fig 17: Monthly Analysis of Sub-metering 3
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org768
![Page 11: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/11.jpg)
4.4 Yearly Analysis
Year wise analysis of global active power, reactive power and intensity conclude that
worm start was saturated to maximum value in mean years and slightly low into recent
years. Besides this, a slight hike in values has been observed for submeter 1 and 2. In
case of submeter 3, a sudden raise has been observed in between 2007 and 2008 and
reach with stable and maximum value from 2008 to till 2010 as shown in figure 18-20.
Fig 18: Yearly Analysis of Sub-metering 1.
Fig 19: Yearly Analysis of Sub-metering 2.
Fig 20: Yearly Analysis of Sub-metering 3.
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org769
![Page 12: Outlier Detection and Data Analytics on Individual ...joics.org/gallery/ics-1426.pdfbeing considered for analysis in the field of research and application. In this paper, the analysis](https://reader033.fdocuments.us/reader033/viewer/2022041817/5e5bd6aaedca6c15506a629d/html5/thumbnails/12.jpg)
5. Conclusion And Future Scope
A concluding remark help to evaluate the intensity and significance of the work for
proposed platform. Certain conclusion has been observed as, power consumption
analysis has been a performed and observed for sub-meter data of Clamart city, France.
It has been a performed on basis of daily, weekly, monthly and yearly. Analysis of Sub-
meter 1 and 2 concludes that heavy but stable power consumption happened in between
morning 7 to midnight 12 o’clock. It reduces into night session. For Sub-meter 3 it is
concluded that stable result has been an observed with hourly raise in power
consumption in recent years with respect to the previous decade.
Heavy fluctuation has been observed into a yearly graph and justifies the variable
power consumption for the whole year. Proposed solution attempt to remove the outliers,
noise from the existing dataset, and enhance the accuracy of analytics. Overall
conclusion is that heavy power consumption has been an observed into morning, evening
and midnight timing.
References
6.1 Journal Article
[1] Chetan Dharni, Meenakshi Bansal, "Survey on Improved DBSCAN Algorithm",
International Journal of Computer Science and Technology Volume 4, Issue 2, April
- June (2013).
[2] Balaji K. Bodkhe and Dr. Sanjay P.Sood, “Analysis of Smart Meter Data Using
Hadoop", International Journal of Current Engineering and Scientific Research
(IJCESR) VOLUME-2, ISSUE-9, (2015).
[3] Tabhane, Samrudhi, and R. A. Fadnavis, "Large data computing using Clustering
algorithms based on Hadoop", International Journal of Engineering Research and
General Science Volume 3, Issue 2, March-April, (2015).
6.2. Conference Proceedings
[4] Shvachko, Konstantin, Hairong Kuang, Sanjay Radia, and Robert Chansler, "The
Hadoop distributed file system." In Mass storage systems and technologies (MSST),
2010 IEEE 26th symposium on, pp. 1-10. IEEE, (2010).
[5] Chujai, Pasapitch, Nittaya Kerdprasop, and Kittisak Kerdprasop, "Time series
analysis of household electric consumption with ARIMA and ARMA models", In
Proceedings of the International Multiconference of Engineers and Computer
Scientists, vol. 1, pp. 295-300. (2013).
[6] Ali, Usman, Concettina Buccella, and Carlo Cecati, "Households electricity
consumption analysis with data mining techniques," In Industrial Electronics Society,
IECON 2016-42nd Annual Conference of the IEEE, pp. 3966-3971. IEEE, (2016).
[7] Sauhats, Antans, Renata Varfolomejeva, Olegs Lmkevics, Romans Petrecenko, Maris
Kunickis, and Mans Balodis, "Analysis and prediction of electricity consumption
using smart meter data", In Power Engineering, Energy and Electrical Drives
(POWERENG), 2015 IEEE 5th International Conference on, pp. 17-22. IEEE, (2015).
[8] www.archive.ics.uci.edu.
Journal of Information and Computational Science
Volume 9 Issue 9 - 2019
ISSN: 1548-7741
www.joics.org770