Fault Detection AI For Solar Panelsuu.diva-portal.org/smash/get/diva2:1441285/FULLTEXT01.pdf ·...

44
Sj¨ alvst ¨ andigt arbete i informationsteknologi 15 juni 2020 Fault Detection AI For Solar Panels Jonathan Kur´ en Simon Leijon Petter Sigfridsson Hampus Wid ´ en Civilingenj ¨ orsprogrammet i informationsteknologi Master Programme in Computer and Information Engineering

Transcript of Fault Detection AI For Solar Panelsuu.diva-portal.org/smash/get/diva2:1441285/FULLTEXT01.pdf ·...

  • Självständigt arbete i informationsteknologi15 juni 2020

    Fault Detection AI For Solar Panels

    Jonathan KurénSimon LeijonPetter SigfridssonHampus Widén

    Civilingenjörsprogrammet i informationsteknologi

    Master Programme in Computer and Information Engineering

  • Institutionen förinformationsteknologi

    Besöksadress:ITC, PolacksbackenLägerhyddsvägen 2

    Postadress:Box 337751 05 Uppsala

    Hemsida:https://www.it.uu.se

    Abstract

    Fault Detection AI For Solar Panels

    Jonathan KurénSimon LeijonPetter SigfridssonHampus Widén

    The increased usage of solar panels worldwide highlights the impor-tance of being able to detect faults in systems that use these panels. Inthis project, the historical power output (kWh) from solar panels com-bined with meteorological data was used to train a machine learningmodel to predict the expected power output of a given solar panel sys-tem. Using the expected power output, a comparison was made betweenthe expected and the actual power output to analyze if the system wasexposed to a fault. The result was that when applying the explainedmethod an expected output could be created which closely resembledthe actual output of a given solar panel system with some over- and un-dershooting. Consequentially, when simulating a fault (50% decrease ofthe power output), it was possible for the system to detect all faults ifanalyzed over a two-week period. These results show that it is possibleto model the predicted output of a solar panel system with a machinelearning model (using meteorological data) and use it to evaluate if thesystem is producing as much power as it should be. Improvements canbe made to the system where adding additional meteorological data, in-creasing the precision of the meteorological data and training the ma-chine learning model on more data are some of the options.

    Handledare: Mats Daniels, Dilushi Piumwardane, Björn Victor och Tina VrielerExaminator: Björn Victor

  • Sammanfattning

    Med en ökande användning av solpaneler runt om i världen ökar även betydelsen av attkunna upptäcka driftstörningar i panelerna. Genom att utnyttja den historiska uteffekten(kWh) från solpaneler samt meteorologisk data används maskininlärningsmodeller föratt förutspå den förväntade uteffekten för ett givet solpanelssystem. Den förväntade utef-fekten används sedan i en jämförelse med den faktiska uteffekten för att upptäcka omen driftstörning har uppstått i systemet. Resultatet av att använda den här metoden är atten förväntad uteffekt som efterliknar den faktiska uteffekten modelleras. Följaktligen,när ett fel simuleras (50% minskning av uteffekt), så är det möjligt för systemet att hittaalla introducerade fel vid analys över ett tidsspann på två veckor. Dessa resultat visaratt det är möjligt att modellera en förväntad uteffekt av ett solpanelssystem med en ma-skininlärningsmodell och att använda den för att utvärdera om systemet producerar såmycket uteffekt som det bör göra. Systemet kan förbättras på några vis där tilläggandetav fler meteorologiska parametrar, öka precision av den meteorologiska datan och tränamaskininlärningsmodellen på mer data är några möjligheter.

  • Contents

    1 Introduction 1

    2 Background 1

    2.1 An Overview of Solar Cells . . . . . . . . . . . . . . . . . . . . . . . . 2

    2.2 Factors Affecting Power Output . . . . . . . . . . . . . . . . . . . . . 2

    2.3 STRÅNG - A Solar Irradiance Model . . . . . . . . . . . . . . . . . . 3

    2.4 Machine Learning Concepts . . . . . . . . . . . . . . . . . . . . . . . 3

    2.4.1 Regression vs Classification . . . . . . . . . . . . . . . . . . . 3

    2.4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.4.4 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 5

    3 Purpose, Aims, and Motivation 6

    3.1 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    4 Related Work 7

    4.1 A Statistical Method to Find Faults . . . . . . . . . . . . . . . . . . . 8

    4.2 Comparing Simulated Output with Measured Output . . . . . . . . . . 8

    5 Method 9

    5.1 Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    5.2 Machine Learning Model - Random Forest Regression . . . . . . . . . 10

    5.3 Scoring Method and Validation Technique . . . . . . . . . . . . . . . . 10

    5.4 Meteorological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    6 System Structure 11

  • 7 Requirements and Evaluation Methods 13

    7.1 Regression Model Testing . . . . . . . . . . . . . . . . . . . . . . . . 13

    7.2 Fault Detection Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    8 Data Gathering 14

    8.1 Weather Data from SMHI . . . . . . . . . . . . . . . . . . . . . . . . . 15

    8.2 Solar Irradiance Data from STRÅNG . . . . . . . . . . . . . . . . . . 15

    8.3 Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    9 Data Preprocessing 16

    9.1 Removal of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    9.2 Constructing and Adding Data . . . . . . . . . . . . . . . . . . . . . . 16

    10 Fault Detection 17

    10.1 Expected Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    10.2 Finding Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    10.3 Simulating Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    11 Evaluation results 20

    11.1 Regression Model Results . . . . . . . . . . . . . . . . . . . . . . . . 20

    11.2 Fault Detection Results . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    12 Results and Discussion 24

    12.1 Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    12.2 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    12.3 Precision of the Meteorological Data . . . . . . . . . . . . . . . . . . . 25

    12.4 Expected Output - Overshooting and Undershooting . . . . . . . . . . . 26

  • 12.5 Fault Detection - Analysis . . . . . . . . . . . . . . . . . . . . . . . . 27

    13 Conclusions 28

    14 Future Work 28

    A Libraries 32

    B Explanatory Variables 32

    C Models and Scalers 33

    D Data Set 34

    E Decrease Tables 36

  • 2 Background

    1 Introduction

    The usage of solar panels, also referred to as photovoltaic (PV) modules, has seen analmost exponential increase globally during the last decade [19]. How a system of PVmodules perform depends heavily on weather [9], dirt covering the panels [2] and amultitude of other factors. Some factors that decrease the energy production of PVmodules are natural and can not be prevented by an owner, such as the angle of thesun, clouds and other weather related causes. There are other sorts of energy productiondecreases that an owner could actually stop from happening. Examples of these couldbe a PV module breaking down, leaves covering the panels or other factors of similarnature. The focus of this project was to develop a software based fault detection systemthat can detect decreases of this kind, in a photovoltaic system.

    To detect severe decreases, hereinafter referred to as faults, in a PV system the historicalenergy production of the system and meteorological data was utilized. The meteorolog-ical data is gathered in close geographical and temporal proximity to the PV system.The energy production of a PV system over a given time period (e.g an hour) will inthis report be referred to as power output, measured in the unit of kWh. The poweroutput and meteorological data was then used to train a machine learning model whichpredicts the expected power output for a given PV system. The fault detection systemuses the expected, or predicted, power output and the actual power output to detect ifthe PV system is faulty or not. A fault detection system of this kind will make it easyto diagnose and detect a faulty PV system seeing as it can be done remotely. This couldlead to faster reparations or necessary maintenance and more produced energy.

    The aim of creating a fault detection system that can detect faults in PV systems byusing machine learning together with historical power output and meteorological datawas achieved. However, the extent of what the system can detect turned out to bedependant on three parameters: the size of the power decrease, the threshold and thetime horizon. By simulating a fault (50% power output decrease) it was found thatthe fault detection system can detect all of the faults while analyzing over a two-weekperiod. However, simulating lesser faults, decreasing the time horizon or decreasing thethreshold may negatively impact the result which is elaborated on later in the report.

    2 Background

    This section will begin with a brief description of solar cells, how they work and whataffects the power output. General concepts regarding machine learning that are of im-portance to this project will also be covered.

    1

  • 2 Background

    2.1 An Overview of Solar Cells

    Solar cells, or PV cells, generate electricity by being hit by light. A cell is made ofsemiconductors, often silicon, and the cell has conductors connected to its positive andnegative side, forming an electric circuit. When the cell is hit by light, electrons will bereleased from the semiconductor material and create an electric current [15]. Multiplecells connected together is called a PV module and multiple modules connected togetheris called an PV array. Figure 1 illustrates this.

    Figure 1 Visualization of a PV array and its components

    A module’s efficiency is based on its direct current (DC) output under certain conditionscalled Standard test conditions [12]. These conditions are specific conditions concern-ing a modules temperature and the solar irradiance the module is exposed to. Underthese conditions modern PV modules have an efficiency of around 15%, which meansthat they can convert 15% of the sunlight into electric energy [32].

    2.2 Factors Affecting Power Output

    There are a lot of different factors that affect the output of a solar cell and some majorones will be described here. One example of this is clouds, which reduce the amountof sunlight that can reach the PV modules, but it does not remove all of it. On a cloudyday the power produced can be reduced by 75% [17]

    Higher temperatures reduce the power output of PV modules. Each module has a

    2

  • 2 Background

    property called the temperature coefficient pMax. This coefficient is provided by themanufacturer and it describes how much the maximum power will decrease when thetemperature rises 1°C above 25°C. The reduction in power can range between 10-25%depending on the location of the PV system [6, 18]. For example if the pMax is -0,5 andthe temperature of the module is 45°C the reduction in power will be 10%.

    PV systems will degrade over time and normally the decrease in power will be 10% after10-15 years and 20% after 20-25 years [22]. PV systems that were subjected to snowand wind had higher degradation rates than the ones that were not. Also, PV arrays thatwere mounted in the desert had high degradation rates [16].

    2.3 STRÅNG - A Solar Irradiance Model

    STRÅNG is a model that calculates different solar irradiance parameters over north Eu-rope [30]. It was created through a joint effort between SMHI, Swedish Environmen-tal Protection Agency (Naturvårdsverket) and the Swedish Radiation Safety Authority(Strålsäkerhetsmyndigheten). By using stations that can measure solar irradiance to-gether with information about clouds, ozone and water vapor, STRÅNG can estimatethe solar irradiance at a certain latitude and longitude combination. When using hourlymodel predictions the error can be up towards 30% [31].

    2.4 Machine Learning Concepts

    Machine learning is the study of computer algorithms that improve automatically throughexperience[13]. In machine learning, the data can be preprocessed in an attempt to havethe algorithm make better predictions or decisions. The algorithms used can be evalu-ated using different methods but also validated how well they behave on new data. Thefollowing sections describe machine learning concepts, such as those above, used in thisreport.

    2.4.1 Regression vs Classification

    Within machine learning two categories exist, supervised and unsupervised machinelearning. In supervised learning the model is given input, called explanatory variablesor features, and an output, called response variable, whereas in unsupervised learningonly explanatory variables are given [25, pp 393-394]. Supervised machine learninghave two subcategories, regression and classification. Both of these share the same

    3

  • 2 Background

    concept of trying to learn the mapping function between the explanatory variables andthe response variable.

    Regression techniques predict a single, continuous, value. Example of such could be topredict the house pricing based on explanatory variables such as location and size. Onthe other hand, classification techniques try to group the predicted value inside a class.This results in the predicted value to be discrete. Example of a classification problemcould be to classify an image of an animal to classes such as a cat or dog.

    2.4.2 Data Preprocessing

    Preprocessing data is done to prepare raw data for further processing [23]. Real worlddata can be corrupt, inaccurate and sometimes data points can be missing. Missing datais a common occurrence and can threaten data quality. To deal with this there existtechniques such as imputing a value or to simply exclude the entire record [24].

    A scaler is sometimes applied to the data before feeding it to the machine learningmodel. There are several reasons to use a scaler. One aspect is to balance (scale, normal-ize, standardize) the data, which leads to a more even representation of the explanatoryvariables [10]. If not performed, the explanatory variables with large values will heav-ily impact the machine learning model, even if the explanatory variable in reality haslittle impact. Furthermore, scaling the data leads to faster convergence when trainingthe machine learning model.

    Further on some data sets, more specifically time series data, can have seasonal or cycli-cal behavior. An example of such could be the correlation of time and temperature,higher temperatures midday than at midnight. Representing this behavior in the datasets can further improve the later analysis. A method to represent this as a explanatoryvariable in a data set is to translate the linearity of time into the cyclical behavior of sineand cosine.

    An other technique that can be used when working with time series data and forecast-ing is the sliding window technique [3]. This technique is based on adding previousresponse variables to the next entry as explanatory variables. The amount of previousresponse variables added as explanatory variables determines the window size.

    2.4.3 Evaluation Metrics

    Evaluating the accuracy of a machine learning model can be done in many differentways. Depending on if it is a regression or classification problem, the methods will

    4

  • 2 Background

    differ. For classification problems the F1 score is a common one. The F1 score is ameasure between 0-1 and is calculated by using the predictions a model made, morespecifically the true positives, false positives and false negatives. The closer the score isto 1 the better the model is [35].

    For regression problems the R2 score and the root mean squared error (RSME) are twooften used evaluation metrics. The R2 score shows, on a scale between 0-1, how closethe predicted output is to the real data, where 1 indicates that the model can explain100% of the variance in the output. The lower the score the less the model can explainthe variance in the output [1]. RSME gives, as the name says, the root of the mean ofthe squared error the predicted output has from the real output. A low value indicatesthat the difference between the predicted and real output is small and that the model isgood [21].

    2.4.4 Model Validation

    When a machine learning model has been trained it is of importance to validate it onnew data to avoid issues like overfitting and selection bias. Below are the validationtechniques that were considered for this project.

    k-Fold Cross-validation

    One common validation technique in machine learning is cross-validation. One methodof cross-validation is what is called k-Fold cross-validation. The process of the k-Foldmethod consists of the following steps [25, pp 32-33]:

    1. Shuffle the data set

    2. Split the data set into k smaller sets

    3. Train on k-1 sets

    4. Test the prediction accuracy on the leftover set for the evaluation score

    Then, the process can be repeated while holding out different test sets for the next itera-tion. Lastly, the model evaluation scores are summarized to represent the total score forthe trained model.

    LOOCV - Leave One Out Cross-validation

    Another form of cross-validation is Leave One Out Cross-validation [4]. This is essen-tially an extreme case of k-Fold where k is chosen to be the number of data points in

    5

  • 3 Purpose, Aims, and Motivation

    our data set. With the LOOCV method there is no grouping of data points into smallersets like what is done with the k-Fold method. Instead, all data points except for one isused to train on while only one data point is used to test the accuracy of the model. Thishas the advantage of using more data to train on.

    3 Purpose, Aims, and Motivation

    The aim of this project was to develop a system that can detect faults in PV systems. Byusing the historical power output from PV systems combined with meteorological data,the system should be able to model the expected output of a PV system. The faults aredetected by comparing the expected power output with the actual power output of thePV system.

    Concretely, the goals of the system is to find as many real faults as possible while at thesame time not stating that there are faults if there are none. The time it takes for thesefaults to be found should also be as low as possible.

    The idea that motivated this project came from our stakeholder HPSolarTech, a Upp-sala based company working in the photovoltaics industry. They wanted to explore thepossibility of creating a software oriented solution that can detect faults in PV systemsusing machine learning. If this project accomplishes this, it would be a general improve-ment for the solar industry, which could result in an increased usage of PV systems inour society.

    One of the reasons that this project is of importance is the fact that the installed PVsystems across the world helps to reduce the CO2 emissions. Increasing the share of PVin the electricity mix will hence decrease the environmental impact of countries that im-plement it [19]. If our project succeeds in making solar energy more appealing it wouldtherefore have an indirect, positive impact for the progress towards environmentallysustainable cities.

    3.1 Delimitations

    A delimitation, concerning the machine learning part of the system, is that the onlyavailable information about a PV system is the power output. This makes it hard touse one model for all PV systems since the PV systems can differ in size and efficiency.Having one model for all PV systems would mean that the data for all PV systems couldbe used to train that one model. This would lead to the model having a lot more data to

    6

  • 4 Related Work

    train on, rather than splitting the data amongst several models.

    An integral part of this project is to be able to detect when there is a fault in a PVsystem in the form of a technical malfunction. However, the data available from the PVsystems does not have recorded information about when faults have actually occurred.Consequentially, there are two effects on this project. Firstly, this means that the modelmight be training on faults. For example, if one of the PV systems have had a fault foras long as data has been recorded there is no way for the fault detection system to tellthat there is a fault. Secondly, the faults the system is to make predictions on needs to besimulated. While these simulations can be close to reality they obviously do not exactlycorrespond to what an actual fault would look like.

    During November, December and January the panels have a high risk of being coveredby snow. The system considers snow covering the panels a fault and should alert whena panel is covered by it. The consequence of this is that if the model is trained on dataduring the mentioned months it will train on faulty data according to the definition ofa fault. Therefore, these months were removed from the data sets which significantlyincreased the performance of the model. The effect of doing this is that the system willnot work during these months.

    Lastly, one of the main points of the system is to be able to detect malfunctioning solarpanels with the only information extracted from a given PV system being its poweroutput. By only using the power output it is not possible for the fault detection systemto know if the PV system in question grows in size or capacity, there would have toexist extra information to know this. This means that it is possible for a PV system toincrease in capacity while the model is trained on it having lower capacity which willmake it hard for the model to accurately predict if there is a fault.

    4 Related Work

    This section highlights some studies and projects related to this project in ways of de-tecting faults in PV systems. The referenced systems are similar in certain ways likewhat metric is measured from the PV modules while they differ in how they analyze thegathered data.

    7

  • 4 Related Work

    4.1 A Statistical Method to Find Faults

    Several studies have conducted analysis of various faults occurring in photovoltaic sys-tems. A study [34] makes use of two different statistical methods. This is done inorder to build a confidence interval for each subsystems power output in the PV system.The underlying assumption however, is that this purely statistical analysis can only beperformed if the PV system can be viewed as a system divided into several equivalentindependent subsystems. In practice, this means that the power output must be readablefor each subsystem (which might not be the case). This study is relevant here since al-though it does not make use of meteorological data, the only physical quantity measuredat the PV system which is analyzed is the power output. The difference between theirapproach and the approach used for this project is regarding what data is used. For thisprojects fault detection system there is less granularity in the information from the PVsystems. This is on the other hand compensated for with additional meteorological data.How the data is processed to detect failures also differs a lot since the study mentioneduses purely statistical methods to determine if a failure has occurred while this projectuses machine learning models to predict failures.

    4.2 Comparing Simulated Output with Measured Output

    Another study that is interesting in relation to this project is the joint study by the Eu-ropean Commission which was a vital part of the PVSAT-2 project [26]. This studycompared the predicted (simulated) power to the measured power from the PV system,taking meteorological data into consideration. The comparisons are done for differenttime spans, comparing the current day, past 7-days as well as past 30-days. Each ofthese time spans are taken into consideration when answering whether or not the mea-sured output is in line with the simulated output. In the case of there being a significantdifference in the output a profile is created, describing the failure of the PV system. Bythen comparing the created profile with predefined ones for the different failures, theprobability for each one of them is calculated. Similar to this project, meteorologicaldata is directly used in order to simulate/predict the current power output from the PVsystem, as well as power being the only measurement. The key difference here is thatthe predicted power is simulated purely off a mathematical model, whilst in this projectit is determined by using AI and machine learning.

    Furthermore, other systems have been developed to automatically detect faults in PVsystems by using the method of simulating the PV system by using a mathematicalmodel. Another example of this is the procedure for automatic fault detection presentedby Silvestre, Chouder and Karatepe where they, akin to the PVSAT-2 study, take the

    8

  • 5 Method

    output of the simulated PV system and compare it to the actual output of the PV systemto determine whether a fault has occurred or not [27]. This is once again related to ourmethod, with the difference being that the expected output from our PV system is notgenerated by a simulated PV system. Instead, our system generates an expected outputbased on historical data by using machine learning as described in Section 5.

    5 Method

    The following section describes the different methods and techniques used in this project.It explains what programming language was used, from where the data was gathered,what machine learning model was used and why it was the best fit for this project.

    5.1 Programming Language

    The code base for this project is written in the programming language Python. Accord-ing to Thomas Elliot at GitHub, Python was one of the most popular languages usedfor machine learning in 2018 [5]. Libraries generally used in connection to machinelearning include numpy for matrix operations, pandas for applying the matrix opera-tions to data sets and sklearn to perform machine learning algorithms to the data. Theselibraries are used in the machine learning part of this project.

    The main contender to Python for machine learning is R. Since this project aims tocreate a system that is usable in the solar panel industry and not in isolation it is im-portant that it is written in a programming language that is easy to combine with otherlanguages. Because of the fact that Python is a general purpose language it is easier tocombine with other languages than R while it also simplifies tasks such as accessingAPIs [8]. Furthermore, the aforementioned libraries for Python make it easier to ma-nipulate data in Python than in R which is important for the handling of the data fromdifferent sources that is processed in the system.

    Considering the aspects discussed above the choice of Python as the programming lan-guage for the entire system was straightforward. It simplifies integration, data process-ing while also having the required support for the machine learning operations that areneeded.

    9

  • 5 Method

    5.2 Machine Learning Model - Random Forest Regression

    In order to determine which machine learning model would be used to predict the ex-pected output a multitude of models were tried, the result of which can be found in Table1. Random Forest Regression, or RFR [14], came out as the best performing one. RFRis a more advanced version of regression trees. A regression tree is a decision tree wherethe output is continuous (i.e a real number) instead of discrete (true, false or choice A,B, C etc). Several regression trees are created by replacing some of the training data foreach tree. Together these regression trees makes up the ”forest”. Finally, the output ofthe RFR is calculated as the mean value of all trees in the forest.

    As can be seen in Table 1, using RFR with no scaler gave the best R2-score. As men-tioned in Section 2.4.2 a scaler can sometimes be used on the data set. This is mostly toavoid the issue of having explanatory variables with different sizes. In RFR this is not aproblem, since the partitioning in the regression trees will be the same even if you scalethe data as long as the order is the same [11].

    5.3 Scoring Method and Validation Technique

    The scoring method to evaluate the machine learning models in this project was chosento be the R2-score. In the choice between the R2 and the RMSE scoring methods, theR2-score was favored since it is a relative measurement of how well the model fit thedata. In order to get a feel of how big an error is with RMSE, the RMSE needs to becompared to the size of the values in the data. Using R2 (which is always between 0-1)the meaning of the score is always intuitive and easy to understand [33]. In practice,when it comes to finding the best machine learning model however, any of these couldbe used. This is because searching for the lowest RMSE and the highest R2 score wouldyield the same result.

    When determining what validation technique to use the main concern, apart from actu-ally being able to validate the model, was computation intensity. Since k-Fold repeatsthe process described in section 2.4.4 k times and LOOCV is essentially k-Fold withas large k as possible it is naturally more computation intensive. Repeating this k-Foldprocess for every data point in the data sets where most of them span multiple yearswith data recorded every hour would be too time-consuming.

    10

  • 6 System Structure

    5.4 Meteorological Data

    Weather data is needed to predict the output of a PV system. Since there is no weatherdata included in the data from the PV systems, it had to be retrieved from somewhereelse. The only service found that provides historical weather data in Sweden is SwedishMeteorological and Hydrological Institute’s (SMHI) open data. By using SMHI opendata it is possible to get the meteorological data for a collection of weather stations inSweden. This data contains parameters such as air temperature, humidity, solar irra-diance, air pressure and cloud base [28]. How often each station saves their data willdiffer from station to station between hourly or once each day, depending on how ad-vanced the station is. It is possible to retrieve either the last hour, the last day, the lastfour months or historical data that has been quality controlled. Some parameters such asair temperature have over 1000 stations located in different parts of Sweden while solarirradiance only has about 20 [29]. To compensate for this STRÅNG, a model whichcalculates solar irradiance at a specific coordinate in northern Europe (see Section 2.3),is used for the solar irradiance parameter.

    6 System Structure

    The system consists of three modules: Data gathering, data preprocessing and faultdetection. See figure 2. Data gathering is where calls to different API’s are done and theresulting data is saved. The data from the API calls is then integrated and processed inthe data preprocessing module. The fault detection module is where a decision is madewhether the PV system is faulty or not.

    11

  • 6 System Structure

    Figure 2 Overview of the system modules and the systems process of finding faults.

    The first module of the system is the Data Gathering module. This is where the systemmakes different API calls to retrieve meteorological data and combine it with PV systemdata for a given PV system. After the data is gathered and compiled into the same file,each PV system has their own file containing data for the PV system correspondingmeteorological data for each data point. The compiled file is then forwarded to the DataPreprocessing module.

    The Data Preprocessing module is responsible for two key areas. Firstly, cleaning thedata retrieved by the Data Gathering module. Cleaning data is the process of detectingand correcting corrupt or incorrect data. Secondly, construct and add more features tothe data set. Constructing features refers to the process of converting existing features toanother form. After these steps the data can now be used in the Fault Detection module.

    The last module of the system is the Fault Detection module. This module involves twocomponents. One component for predicting the expected output for a PV system usingthe machine learning model. The other component uses the expected output to detect ifa PV system is faulty or not.

    12

  • 7 Requirements and Evaluation Methods

    7 Requirements and Evaluation Methods

    This section presents the evaluation methods for finding what regression model per-formed the best and how the fault detection system will be tested.

    7.1 Regression Model Testing

    To find what machine learning model is best suited for predicting the output of a PVsystem, a comparison between a number of different regression models was done. ShayGeller [7] published an article on Towards Data Science where he compared the resultsof using different machine learning models on a classification problem. For each modelGeller also compared the results for a variety of different scalers. In the end it waspossible to see which combination of scaler and model had the highest F1-score for theproblem. By switching all classification models to regression models and changing theF1-score to R2-score, Gellers code was used to find the best scaler-model combination.The models and scalers that were used for the comparison are listed in the appendixunder section C.

    The R2-score that is calculated for a model-scaler combination shows how good thepredictions are for a single PV system. As mentioned in Section 2.4.3 the R2-scoreindicates how well the model can explain the variance of the output, a higher score isbetter. The results from one PV system might not be representative of every PV system.Therefore the test is run over multiple PV systems and the mean of the R2-score for allPV systems is used. The model-scaler combination with the highest mean R2-score isthe one that will be used. To evaluate how well the model can represent the expectedoutput, the results in the fault detection part is used as a measurement. If the model isable to find faults, while still upholding the requirements described in Section 7.2, themodel is said to be good enough.

    7.2 Fault Detection Testing

    To see how well the fault detection system can find faults, three tests will be conducted.In the first test the simulated fault will be a 40% decrease, the second will be a 50%decrease and the third one will be a 60% decrease. For each test there will be 42 PVsystems that will each have the specified fault simulated in the entire test set. Thismeans that there are 42 possible faults to find. Furthermore, for each test there are twovariables that can be changed to find the most optimal fault detector, the threshold andthe time horizon. The tests will be done using thresholds between 20% - 75% and time

    13

  • 8 Data Gathering

    horizons between 3 - 21 days. More information about the threshold, time horizon andsimulated fault can be found in Section 10.2.

    Each test will also be run twice where the second time will be without any simulatedfault. This is because if the system finds a fault in a test without any simulated faults thisneeds to be marked as a false positive. If instead no fault was found in the test withoutsimulated faults but a fault was found in the test with simulated faults, this is marked asa true positive. If no fault was found in either the test without simulated faults or the testwith simulated faults, this is marked as a false negative. More concretely a false positivewould mean that the system found a non existing fault, false negative would mean thesystem did not find an existing fault and a true positive would be the system found a realfault.

    In addition to actually finding the fault, the time it takes to find the fault is also animportant aspect. To measure this the average time it takes for the system to find a fault,when run over all PV systems, is used. Note that the time it takes for the system to finda fault is not actual run time but how many data points need to be used to find a fault.

    The following requirements have been chosen to evaluate for which values of the vari-ables the results are acceptable:

    1. No false positives.

    2. The average time it takes for the system to find the fault needs to be the same asthe time horizon.

    3. No false negatives.

    The first requirement was chosen because it is not acceptable for the system to say thatthere is a fault when there is none, since that might lead to an attempted reparation of anon faulty panel. The second requirement was chosen because when using a certain timehorizon it is expected that the system finds the faults in that time. If the average timeexceeds the time horizon it means that some faults were unnoticed during some timewhich is something that is to be avoided. The third requirement was chosen because thesystem should be able to find all the faults.

    8 Data Gathering

    An overview of the data gathering process is as follows: for each PV system, retrieveweather data from SMHI, as well as solar irradiance data from STRÅNG and combine

    14

  • 8 Data Gathering

    the data into a single CSV file for the PV system. Once this is complete for all PVsystems, the database containing power output is accessed and for each PV system itscorresponding values are read and added to the existing CSV file.

    8.1 Weather Data from SMHI

    By using the SMHI Open Data API for meteorological observations, historical data canbe gathered from a weather station. The API is used by calling for a specific parameter(e.g air temperature), from a specific weather station. The immediate problem is howto decide on what station to gather the parameter data from. The method used is tocalculate the distance from the PV system to the different stations. For a majority of thePV systems analyzed, either the exact coordinates are known or at the very least the zipcode which can then be used in conjunction with geocoding packages such as pgeocodeto find its approximate coordinates.

    By using the PV systems coordinates and comparing them to the coordinates of eachweather station the nearest station can be found. However, different weather stationsmeasure different meteorological parameters and additionally, different stations havedifferent time spans. Taking this into consideration, a new station is chosen for eachindividual parameter. The chosen station is not always the closest station, but ratherthe closest station which measures the sought parameter and has the correct time span.There is however one parameter which is not gathered by the SMHI API calls, namelysolar irradiance, and that is where STRÅNG is used.

    8.2 Solar Irradiance Data from STRÅNG

    Accessing STRÅNGs data can be done with an API call containing the start date, enddate, coordinates and parameter. The parameter in the STRÅNG API that is used iscalled CIE UV irradiance. The start and end dates are dynamically chosen based off ofthe time span in the earlier SMHI call. The coordinates can directly be read off the PVsystem and used in the API call, compared to SMHI API call where a station had to belocated.

    8.3 Data Format

    When all the API calls are done the data needs to be compiled in a CSV file. Therewill be one row of data for every hour in the time span used in the data gathering. For

    15

  • 9 Data Preprocessing

    each row there will be columns containing the date (year/month/day/hour) representedas a Unix timestamp [20], the meteorological data (explanatory variables) and the poweroutput (response variable).

    9 Data Preprocessing

    The steps of preprocessing the CSV files is as follows: Read the CSV file, remove rowswith missing values, remove unnecessary explanatory variables, construct and add someexplanatory variables. An example of a CSV file used can be found in Appendix D. Thefollowing subsections discuss the steps of removing data and the way new explanatoryvariables are constructed and added.

    9.1 Removal of Data

    As mentioned before in section 2.4.2, real world data can often be corrupt or have miss-ing values. In the CSV files there are no corrupt values but there might be a few missingvalues. This problem could come from the APIs, SMHI and STRÅNG, having somemissing data points. To resolve this problem, the rows with missing values are simplyremoved. Apart from this a column with the date, in the format of Unix time, also needsto be removed and not used as a explanatory variable. This column represents the dateas a continuous value which is increasing in a linear fashion, this does not correlate tothe cyclical behavior of hours in a day and months in a year. For example, during certainparts of the day and certain months there are more sun hours.

    9.2 Constructing and Adding Data

    In the previous section an example of the cyclical behavior of hours and months wasmentioned. Because these factors play a big role in solar production they cannot befully disregarded when analysing the data set. To add these factors back into the datasets but with the cyclical and seasonal behavior, sine and cosine is used. Firstly, twocolumns for hours is generated by converting a hourly time to a numerical value gen-erated by sine and cosine. Following is how these values are converted and generated:sin hour∗2π

    24or cos hour∗2π

    24, where hour is the specific hour to be converted (a number

    between 0 and 23). Similarly, this is done for months in the following way: sin month∗2π12

    or cos month∗2π12

    , where month is the month represented as a value between 0 and 11.

    16

  • 10 Fault Detection

    Another column is then added to the data set, this column has values based on a tech-nique called sliding window. As mentioned in section 2.4.2, this works by adding theprevious response variable as a explanatory on the next row. This becomes problematicfor the first row, since this row has no previous value. Because the value is missing therow is deleted, similar to how it is handled in Section 9.1. The new first row uses theoutput for the removed row as the sliding window value.

    10 Fault Detection

    The steps in the fault detection will be as following. An expected output is created usinga machine learning model. Then faults will be simulated in the real output. The systemwill then try to find these faults by comparing the expected output with the faulty output.

    10.1 Expected Output

    The expected output is calculated by using a Random Forest Regression model. Thenumber of trees in the random forest was set to 100. During testing with lower numberof trees, significant drops in the R2 score were observed. Increasing it to more than 100trees did not yield any observable improvement in terms of R2 score, while negativelyaffecting runtime.

    When creating the expected output for a PV system, 80% of the data set is split intoa training set and the remaining 20% is used as the test set. These sizes were chosensince having a big training set leads to a better trained model, which gives better results.However when increasing the size of the training set to 90%, the test set of 10% becamevery small for some PV systems which did not have a lot of data. A small test set mightnot represent the data accurately. Splitting the training and test sets to 80-20 was agood balance between having as big a training set as possible whilst still having goodrepresentation of the total data in the test set. The random forest model then trains onthe training set and is used to make predictions on the test set. The predictions made onthe test set will be the expected output of the PV system and is used in the part of thesystem that finds the faults.

    17

  • 10 Fault Detection

    10.2 Finding Faults

    Fault detection is performed by comparing the expected output with the actual outputof the PV system. The difference between the expected output and measured outputrelative to the expected output gives the relative fault in terms of percentage. The relativefault is used in a comparison with a threshold of for example 50%.

    The fault detection does not happen momentarily, i.e. for one data point. Instead, alldata points over a specified length of time, called the time horizon, are considered.Figure 3 shows how a seven day time horizon over the predicted output and a faultyoutput.

    Figure 3 A 7 day time horizon visually represented on a plot showing the predictedoutput vs a faulty output with a 50% decrease

    For all data points in the time horizon the error i.e. the average difference relative to theaverage expected output is considered:

    Error =avg(expected−measured)

    avg(expected)

    This Error is what is compared with the threshold, if it exceeds the threshold the systemwill report a fault. If no error is found the time horizon will move forward one hourand the error will be calculated again. In this way the time horizon will move through

    18

  • 10 Fault Detection

    the entire data set containing as many data points as the size of the time horizon at anygiven time. The system will report no errors if the entire data set is traversed by the timehorizon and no errors exceeded the threshold. The correlation between the thresholdand the time horizon is typically that smaller time horizons lead to more variance, whichresults in needing a higher threshold in order to not report false positives. Similarly, alarge time horizon results in needing a smaller threshold, as not to yield false negatives(i.e, missing actual faults).

    10.3 Simulating Faults

    As described in section 3.1.3 one delimitation of this project was that to test the sys-tem, faults had to be simulated in the data sets. To simulate the faults, the test setcontaining the power output data is decreased by a certain percentage during chosenintervals. The function that handles the decreasing of the data takes a list of (interval,percentage)-tuples as input arguments, where the percentage decrease is applied overthe given intervals. Depending on the length of these intervals and the magnitude of thepercentage decrease the difficulty of detecting a fault will vary.

    Figure 4 Example of how the system predicts vs the real output with a 50% simulatedfault

    Figure 4 showcases the predicted output versus the real output with a simulated faultover a two-week period. The difference between the two plots is what is utilized in thefault detection part of the system.

    19

  • 11 Evaluation results

    11 Evaluation results

    This section shows the results of what model-scaler combination performed the best,together with example plots of how the predictions looks like visually. The section alsoshows the results on how well the fault detection system can find simulated faults.

    11.1 Regression Model Results

    Table of R2-scores for each model-scaler combination where the highest R2-score ismarked in bold. What can be seen in this table is the R2-scores of each model-scalercombination. For each model every scaler is tried in combination to see which combi-nation performs the best. As can be seen, random forest regression outperforms everyother model.

    ModelScaler DTR KNN LASSO LR MLP RFR

    No Scaler 0.8634 0.8139 0.8629 0.8704 0 0.9218MaxAbsScaler 0.8708 0.9001 0.8095 0.8546 0.8843 0.9200MinMaxScaler 0.8710 0.9043 0.8064 0.8547 0.9105 0.9200

    Normalizer -0.0015 0.8642 -0.0015 0.8527 -0.0023 -0.0015PT Yeo-Johnson 0.8714 0.8935 0.7431 0.7511 0.9154 0.9216

    QT-Normal 0.8710 0.8994 0.5948 0.5994 0.9145 0.9200QT-Uniform 0.8705 0.8980 0.6081 0.6557 0.9196 0.9200RobustScaler 0.8709 0.9081 0.8524 0.8547 0.9105 0.9200

    StandardScaler 0.8709 0.9076 0.8463 0.8547 0.9095 0.9199

    Table 1 Model-Scaler result.

    Figure 5 and Figure 6 shows the predicted output versus the real output of a PV systemduring two different 14 day periods. The predicted output is generated by the machinelearning model chosen after evaluating the model-scaler combination R2-scores, i.e.random forest regression. It is observable that the predicted output in Figure 5 moreclosely resembles the real output than in Figure 6 which is explained in Section 12.4.

    20

  • 11 Evaluation results

    Figure 5 Example of how well the system predicts the output during a two-week period

    Figure 6 Example of how the system predicts the output during a two-week period

    21

  • 11 Evaluation results

    11.2 Fault Detection Results

    Result of the fault detection system when using different time horizons representedas true positives (correctly found faults), false positives (”detected” a fault when noneexisted), false negatives (missed a fault) and the average time for all the PV systems forthat threshold, taken until, in days, the error was found. All the tables shows the resultsfrom the test where the simulated error was a decrease of 50%. The results from theother tests with 40% and 60% decreases can be found in Appendix E.

    Table 2 shows that, with a time horizon of 21 days, it is possible to detect all faultswith a threshold ranging from 35-45%. Decreasing the time horizon further makes thesystem detect non-existing faults as indicated by the one false positive at threshold 30%.Conversely, increasing the threshold above 45% makes the system miss reporting actualfaults which is demonstrated by the five false negatives at threshold 50%.

    Threshold % True Positive False Positive False Negative Average days until error found20 37 5 0 21.00025 39 3 0 21.00030 41 1 0 21.00035 42 0 0 21.00040 42 0 0 21.00045 42 0 0 21.27550 37 0 5 34.59755 19 0 23 64.54060 3 0 39 69.764

    Table 2 50% decrease with a time horizon of 21 days.

    Table 3 shows that with a time horizon of 14 days it is possible for the system to detectevery fault with a threshold of 45%. Setting the threshold higher than 45% has the sameeffect as in the case of a 21 day time horizon where the system begins to miss actualfaults. Further, it is noteworthy that setting the threshold to lower than 45% e.g. 40% or35% makes the system miss faults where it did not with the 21 day time horizon.

    22

  • 11 Evaluation results

    Threshold % True Positive False Positive False Negative Average days until error found35 40 2 0 14.00040 41 1 0 14.00045 42 0 0 14.30650 38 0 4 27.50355 25 0 17 59.97760 11 0 31 64.515

    Table 3 50% decrease with a time horizon of 14 days.

    The most notable result of Table 4 is that it is no longer possible for the system to detectall faults with any given threshold. This is the result of having a small time horizon setto 7 days.

    Threshold % True Positive False Positive False Negative Average days until error found35 36 6 0 7.00040 37 5 0 7.02745 41 1 0 7.31650 41 0 1 11.24455 31 0 11 48.95260 20 0 22 56.808

    Table 4 50% decrease with a time horizon of 7 days.

    Table 5 illustrates an extreme case where the time horizon is very small. Noticeably,the non-existing faults reported as faults are quite many compared to the other timehorizons. Increasing the threshold can compensate for this as can be seen for thresholdsabove 60%. However, for these greater thresholds, the unreported actual errors increase.

    23

  • 12 Results and Discussion

    Threshold % True Positive False Positive False Negative Average days until error found35 15 27 0 3.00040 15 27 0 3.01445 20 22 0 3.32150 28 14 0 4.39455 29 9 4 25.61660 24 4 14 45.88965 24 2 16 54.70370 26 0 16 61.19175 13 0 29 56.224

    Table 5 50% decrease with a time horizon of 3 days.

    12 Results and Discussion

    In this section the results of the fault detection and the regression model, along with itsfeature importance will be discussed.

    12.1 Regression Model

    The R2-scores shown in Table 1 indicates that Random Forest Regression performs thebest on the given data. Furthermore, the best R2 score achieved given RFR, is when noscaler is used but the results for all scalers are very similar. This is to be expected sincescaling the data has close to no impact on this type of model.

    12.2 Feature Importance

    Feature importance is a quantitative measurement of how important the feature/explana-tory variable is in the ML model. A higher feature importance means that the explana-tory variable contributes more to the decision in the model. For example if the featureimportance is zero, the explanatory variable plays no part in the decision making. Thefollowing table displays the feature importance for the RFR model.

    24

  • 12 Results and Discussion

    Explanatory Variable Feature ImportanceSun Hours 0.80198

    Previous Output 0.14447Air Humidity 0.01184Air Pressure 0.00862

    Cloud Coverage 0.00735Air Temperature 0.00725

    sin(hour) 0.00668sin(month) 0.00418cos(hour) 0.00348

    Precipitation 0.00246cos(month) 0.00171

    Table 6 Feature importance of each explanatory variable in descending order.

    Unsurprisingly Sun Hours was the most dominant explanatory variable. Seeing asthe power output drops to zero when there’s no sun present this comes rather natural.Something that was not in line with our initial thinking however, was the (un)importanceof the Air Temperature, which we originally thought would be one of the top rankingexplanatory variables. Another thing that was rather unexpected was that PreviousOutput, which is a parameter created/constructed rather than directly read off a weatherstation or similar, had such a major influence on the model. What this means, is thatthe current value actually is rather dependant on what the previous value was. Thisconcretely shows the benefit of constructing and adding additional explanatory variablesto the machine learning model.

    Since the explanatory variables has a one to one mapping to the input parameters, theRFR model is limited by which parameters are used. This means that it is possible thatother meteorological parameters that we did not take into account could have had a moreexplanatory effect on the power output. An easy way to test if that is the case is to fetchmore of the available parameters from SMHI like thunder probability or wind directionand train the model with the additional parameters.

    12.3 Precision of the Meteorological Data

    It is of importance that the fault detection system is able to retrieve as precise meteo-rological data as possible so that it can represent the weather surrounding a given PVsystem. However, when retrieving meteorological data for a given PV system the fault

    25

  • 12 Results and Discussion

    detection system finds the closest SMHI weather station to the (lat,long)-coordinates ofthe given PV system as described in section 8.1. After finding the closest SMHI weatherstation it then retrieves the data. In some cases this means that the weather station that isthe closest might actually be far away from the PV system which results in non-accuratedata.

    Furthermore, when retrieving data for solar irradiance using STRÅNG there is an er-ror margin of 30% that cannot be not affected. Additionally, the meteorological datagathered at a specific date and time will be the current meteorological state at the givenposition at that exact time. This means that it does not exactly represent the weathersurrounding the station for the entire hour that power output data has been collected.Consequently, if the weather is unstable during that hour, the meteorological data gath-ered will not have a precise correlation to the power output of the PV system. Theresult of these factors is that the machine learning model may train on data that does notrepresent reality accurately.

    12.4 Expected Output - Overshooting and Undershooting

    Figure 5 and Figure 6 illustrates how the system predicts the output of a given PVsystem during a two-week period between August-September and October respectively.In this context overshooting means that the system believes that the PV system shouldbe producing more power output than it is which can lead to false positives. Conversely,undershooting means that the system believes that the PV system should be producingless power output than it is which can lead to false negatives. It is observable that for thefirst period the model has less under- and overshooting. This is most likely due to thepower output being more stable and hence more predictable. The second period showsthat the system sometimes overshoots with its predictions e.g. between 2019-10-11 and2019-10-13 while also undershooting at around 2019-10-21. This means that based onthe figures presented we can expect the system to report some erroneous information.A reason why the expected output is overshooting and undershooting might be the errormarginal in the STRÅNG data, which as mention in section 2.3 can be 30%. An obviousimprovement would therefore to have more accurate sun data. One way to achieve thiswould be to actually measure the irradiance at the location of the PV system. Thiswould however require extra measurement equipment and is not within the scope of thisproject, where a software based solution was explored.

    26

  • 12 Results and Discussion

    12.5 Fault Detection - Analysis

    The results presented in Section 11.2 showcases that it is in fact possible for the systemto fulfill the requirements described in Section 7.2. If analyzing over a time horizon of21 (Table 7) days it is possible to achieve zero false positives and have the average dayuntil error found be the same as the time horizon. It can also be seen that when havinga larger time horizon it is possible to have a lower threshold and still keep the falsepositives close to or at zero. To further increase the likelihood of keeping false positivesat zero we would need to set the threshold relatively high, which would increase falsenegatives and average days until error is found. When having a time horizon of 14 days,the goal of having zero false positives is achieved for one threshold, but the average daysis slightly higher than the time horizon. Even though this does not fulfill the secondrequirement, this might be good enough to use in practice, since the average days is stillvery close to the time horizon. Having a time horizon of 3 or 7 days makes it difficultfor the system to detect more than 50% of the total errors while achieving zero falsepositives which can be seen in Table 9 and Table 10. Along with having low detectionpercentage (true positives vs false negatives) the average days until error is found growsvery large for both time horizons of 3 and 7 days. Therefore having short time horizonsdoes not fulfill the requirements.

    These results are only based on the sample of 42 PV systems where we simulated anerror. This means that these results, where all the faults can be found with no falsepositives, are not to be guaranteed in practice. This is because there might be other PVsystems that have some factor that affects the output that our model cannot explain.

    While we also simulated a fault of 40% and 60% those results do not supply more in-formation than the 50% results discussed and presented in this section. Unsurprisinglythey show that a larger fault makes it easier for the system to detect the faults. Mean-while, a smaller fault makes it harder for the system to find the fault. This can be seenin Appendix E. Furthermore, we did not have any information on how big decreases arefor faults in practice. It is therefore possible that the percentages chosen in the testingare too big. This would lead to the system not being as good as our testing indicates. Ifreal faults are smaller, maybe a 20% decrease, the system would need to have a lowerthreshold. This comes with the downside of having a higher risk of false positives andto fix this the time horizon would need to be longer and therefore increasing the time ittakes to find faults.

    27

  • 14 Future Work

    13 Conclusions

    The aim of the project was to create a system that can detect faults in PV systems withthe use of machine learning and meteorological data. This was achieved by creating anexpected output of a PV system which is compared to the actual output. The extent ofwhat the system can detect turned out to depend on three parameters: the size of thepower decrease, the threshold and the time horizon. The results found in this reportshows that given a certain decrease, the threshold and time horizon can be appropriatelytweaked in order to yield the wanted results of zero false positives, average days thesame as the time horizon and good detection percentage (true positives compared tofalse negatives). In total, it is possible to detect faults in PV systems by utilizing machinelearning and meteorological data.

    14 Future Work

    There are a couple of improvements that could be made to this project to make it moreeffective at achieving the stated goals. The following are some of them in regards to thedelimitations brought up in Section 3.1.

    The algorithm for finding the nearest weather station for a given meteorological pa-rameter was described in section 8.1. This method of finding parameters was a way ofassuring that it is possible to find the parameters to train a model for a given PV system.However, it poses an important question. Namely, how far away can a weather stationbe for a given parameter to be representative of the weather condition where the PVsystem is located? It might be the case that the system is retrieving parameters that aretoo geographically distant for them to have any explanatory effect on the power output.Further studies could be made on this to create a more theoretically supported algorithmof collecting parameters that could possibly increase the precision of the random forestregressor model.

    Furthermore, since the system is very limited in the information it has about the PVsystems it tries to model, as described in section 3.1, one model per PV system is used.If more information was added, like size and capacity, it would be possible to train onemodel for all PV systems. This would probably increase the prediction accuracy if theadditional information added to the system was reasonably explanatory of the poweroutput. In other words, the additional information would need to have a strong enoughcorrelation to the power output.

    28

  • References

    References

    [1] Coefficient of Determination. New York, NY: Springer New York, 2008, pp.88–91. [Online]. Available: https://doi.org/10.1007/978-0-387-32833-1 62

    [2] B. A. Alsayid, S. Y. Alsadi, J. S. Jallad, and M. H. Dradi, “Partial shading of PVsystem simulation with experimental results,” Smart Grid and Renewable Energy04(06):429-435, 2013.

    [3] J. Brownlee. Time Series Forecasting as Supervised Learning. Retrieved 2020-05-07. [Online]. Available: https://machinelearningmastery.com/time-series-forecasting-supervised-learning/

    [4] B. Clarke, E. Fokoué, and H. H. Zhang, Principles and Theory for Data Miningand Machine Learning. Springer Science+Business Media, 2009, p. 588, ISBN:978-0-387-98135-2.

    [5] T. Elliot. (2019, Jan.) The state of the octoverse: machine learning. Retrieved2020-04-29. [Online]. Available: https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/

    [6] S. Fox. (2017, Dec.) How Does Heat Affect Solar Panel Efficiencies? Retrieved2020-04-07. [Online]. Available: https://www.civicsolar.com/article/how-does-heat-affect-solar-panel-efficiencies

    [7] S. Geller. (2019, Apr.) Normalization vs Standardization — QuantitativeAnalysis. Retrieved 2020-04-28. [Online]. Available: https://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebf

    [8] M. Grogan, Python Vs. R for Data Science. O’Reilly Media, Inc, 2018, ISBN:9781492033929.

    [9] B. Guo, W. Javed, B. W. Figgis, and T. Mirza, “Effect of dust and weather con-ditions on photovoltaic performance in doha, qatar,” in 2015 First Workshop onSmart Grid and Renewable Energy (SGRE), 2015, pp. 1–6.

    [10] J. Hale. Scale, standardize, or normalize with scikit-learn. Retrieved 2020-05-15. [Online]. Available: https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02

    [11] T. Hill and P. Lewicki, Statistics: Methods and Applications : a ComprehensiveReference for Science, Industry and Data Mining. International Energy Agency,2006, pp. 99, 100, ISBN: 1-884233-59-7.

    29

    https://doi.org/10.1007/978-0-387-32833-1_62https://machinelearningmastery.com/time-series-forecasting-supervised-learning/https://machinelearningmastery.com/time-series-forecasting-supervised-learning/https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/https://www.civicsolar.com/article/how-does-heat-affect-solar-panel-efficiencieshttps://www.civicsolar.com/article/how-does-heat-affect-solar-panel-efficiencieshttps://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebfhttps://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebfhttps://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02

  • References

    [12] Homer Energy. (2020, Apr.) Standard Test Conditions. Retrieved 2020-04-023. [Online]. Available: https://www.homerenergy.com/products/pro/docs/latest/standard test conditions.html

    [13] J. Hurwitz and D. Kirsch, Machine Learning for Dummies. Wiley Publishing,2018, p. 4, ISBN: 978-1-119-45495-3.

    [14] E. M. Kleinberg, “Stochastic discrimination,” Annals of Mathematics and Artifi-cial Intelligence. 1 (1–4): 207–239, 1990.

    [15] G. Knier. (2020, Mar.) How do Photovoltaics Work. Retrieved 2020-04-01. [Online]. Available: https://science.nasa.gov/science-news/science-at-nasa/2002/solarcells

    [16] T. Lombardo. (2014, Apr.) What Is the Lifespan ofa Solar Panel? Retrieved 2020-04-20. [Online]. Avail-able: https://www.engineering.com/DesignerEdge/DesignerEdgeArticles/ArticleID/7475/What-Is-the-Lifespan-of-a-Solar-Panel.aspx

    [17] S. Madougou and R. Adamou, “Impacts of cloud cover and dust on theperformance of photovoltaic module in niamey,” Journal of Renewable Energy,2017. [Online]. Available: https://doi.org/10.1155/2017/9107502

    [18] J. Marsh. (2019, May) How hot do solar panels get? Effect of temperatureon solar performance. Retrieved 2020-04-07. [Online]. Available: https://news.energysage.com/solar-panel-temperature-overheating/

    [19] G. Masson and I. Kaizuka, Trends in Photovoltaic Applications. InternationalEnergy Agency, Aug. 2019, pp. 75, 94, ISBN: 978-3-906042-91-6.

    [20] N. Matthew and R. Stones, Beginning Linux Programming 4th Edition. WileyPublishing, 2008, p. 148, ISBN: 978-0-470-14762-7.

    [21] T. Meyer, “Root mean square error compared to, and contrasted with, standarddeviation,” Surveying and Land Information Science, vol. 72, p. 1, 09 2012.

    [22] M. Muñoz-Garcı́a, N. Vela, F. Chenlo, and M. Alonso-Garcia, “Early degradationof silicon PV modules and guaranty conditions,” Solar Energy, vol. 85, 06 2011.[Online]. Available: http://oa.upm.es/10081/2/doi.10.1016.j.solener.2011.06.011.pdf

    [23] P. Pandey. Data preprocessing : Concepts. Retrieved 2020-06-09. [Online]. Avail-able: https://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825

    30

    https://www.homerenergy.com/products/pro/docs/latest/standard_test_conditions.htmlhttps://www.homerenergy.com/products/pro/docs/latest/standard_test_conditions.htmlhttps://science.nasa.gov/science-news/science-at-nasa/2002/solarcellshttps://science.nasa.gov/science-news/science-at-nasa/2002/solarcellshttps://www.engineering.com/DesignerEdge/DesignerEdgeArticles/ArticleID/7475/What-Is-the-Lifespan-of-a-Solar-Panel.aspxhttps://www.engineering.com/DesignerEdge/DesignerEdgeArticles/ArticleID/7475/What-Is-the-Lifespan-of-a-Solar-Panel.aspxhttps://doi.org/10.1155/2017/9107502https://news.energysage.com/solar-panel-temperature-overheating/https://news.energysage.com/solar-panel-temperature-overheating/http://oa.upm.es/10081/2/doi.10.1016.j.solener.2011.06.011.pdfhttp://oa.upm.es/10081/2/doi.10.1016.j.solener.2011.06.011.pdfhttps://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825

  • References

    [24] J. Peugh and C. Enders, “Missing data in educational research: A review of report-ing practices and suggestions for improvement,” Review of Educational Research- REV EDUC RES, vol. 74, pp. 525–526, 12 2004.

    [25] S. Rogers and M. Girolami, A first course in machine learning, 2nd ed. 6000Broken Sound Parkway NW, Suite 300: Taylor & Francis Group, LLC, 2016.

    [26] S. Stettler, et al, “Failure detection routine for grid connectedPV systems as part of the PVSAT-2 project,” 2005. [On-line]. Available: https://www.semanticscholar.org/paper/FAILURE-DETECTION-ROUTINE-FOR-GRID-CONNECTED-PV-AS-Stettler-Toggweiler/38975c32afbbf06c9134fae6b48de936211e2ff6

    [27] S. Silvestre, C. Aissa, and E. Karatepe, “Automatic fault detection in gridconnected PV systems,” Solar Energy, vol. 94, 06 2013. [Online]. Available:https://doi.org/10.1016/j.solener.2013.05.001

    [28] SMHI. (2020, Mar.) SMHI Open data. Retrieved 2020-04-01. [On-line]. Available: http://www.smhi.se/data/meteorologi/ladda-ner-meteorologiska-observationer#param=airtemperatureInstant,stations=all

    [29] SMHI. (2020, Mar.) SMHI Questions and Answer. Retrieved 2020-04-01.[Online]. Available: http://www.smhi.se/data/oppna-data/tekniska-fragor-och-svar-1.76975

    [30] SMHI. (2020, Mar.) SMHI STRÅNG. Retrieved 2020-04-01. [On-line]. Available: https://www.smhi.se/forskning/forskningsomraden/atmosfarisk-fjarranalys/strang-en-modell-for-solstralning-1.329

    [31] SMHI. (2020, Mar.) STRÅNG Data Extraction. Retrieved 2020-04-01. [Online].Available: http://strang.smhi.se/extraction/

    [32] Solar. (2020, Mar.) Solar Panel Efficiency. Retrieved 2020-04-01. [Online].Available: https://www.solar.com/learn/solar-panel-efficiency/

    [33] The Analysis Factor. Assessing the fit of regression models. Retrieved 2020-05-22. [Online]. Available: https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/

    [34] S. Vergura, G. Acciani, V. Amoruso, G. E. Patrono, and F. Vacca, “Descriptive andinferential statistics for supervising and monitoring the operation of PV plants,”IEEE Transactions on Industrial Electronics 56(11):4456 - 4464, 2009.

    [35] R. Vidgen, S. Kirshner, and F. Tan, Business Analytics: A Management Approach.Wiley Publishing, 2019, p. 163, ISBN: 978-1-352-00725-1.

    31

    https://www.semanticscholar.org/paper/FAILURE-DETECTION-ROUTINE-FOR-GRID-CONNECTED-PV-AS-Stettler-Toggweiler/38975c32afbbf06c9134fae6b48de936211e2ff6https://www.semanticscholar.org/paper/FAILURE-DETECTION-ROUTINE-FOR-GRID-CONNECTED-PV-AS-Stettler-Toggweiler/38975c32afbbf06c9134fae6b48de936211e2ff6https://www.semanticscholar.org/paper/FAILURE-DETECTION-ROUTINE-FOR-GRID-CONNECTED-PV-AS-Stettler-Toggweiler/38975c32afbbf06c9134fae6b48de936211e2ff6https://doi.org/10.1016/j.solener.2013.05.001http://www.smhi.se/data/meteorologi/ladda-ner-meteorologiska-observationer#param=airtemperatureInstant,stations=allhttp://www.smhi.se/data/meteorologi/ladda-ner-meteorologiska-observationer#param=airtemperatureInstant,stations=allhttp://www.smhi.se/data/oppna-data/tekniska-fragor-och-svar-1.76975http://www.smhi.se/data/oppna-data/tekniska-fragor-och-svar-1.76975https://www.smhi.se/forskning/forskningsomraden/atmosfarisk-fjarranalys/strang-en-modell-for-solstralning-1.329https://www.smhi.se/forskning/forskningsomraden/atmosfarisk-fjarranalys/strang-en-modell-for-solstralning-1.329http://strang.smhi.se/extraction/https://www.solar.com/learn/solar-panel-efficiency/https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/

  • B Explanatory Variables

    A Libraries

    Several Python libraries were used throughout this project. But to mention a few oneswhich stand out. requests, sklearn, pgeocode and pandas.

    Request is an easy to use library for sending HTTP requests. In relation to this projectit translates to API calls, all of which are performed with requests.

    Sklearn is the library to use when it comes to machine learning, being able to performclassification, clustering, regression and more. The library has been used to compare 6models with 8 different scalers to find a suitable model for prediction of power output.

    Pgeocode is a library for geocoding, i.e converting addresses or similar information tocoordinates. It supports 83 countries including Sweden. The library has been used getapproximate coordinates in the case where no coordinates were found for the PV systembut the zip code was available.

    Pandas is a library used for data manipulation and analysis. The Library has been usedto read CSV files to a pandas specific format, DataFrame, and then manipulate the data.Typical usage is when preprocessing the data.

    B Explanatory Variables

    For training the machine learning model the following explanatory variables were used:

    • Air temperature

    • Air humidity

    • Precipitation

    • Cloud coverage

    • Air pressure

    • Solar irradiance

    • Previous output

    • The hour in the date converted with a sine function

    • The hour in the date converted with a cosine function

    32

  • C Models and Scalers

    • The month in the date converted with a sine function

    • The month in the date converted with a cosine function

    All of the meteorological parameters listed above were gathered from the SMHI APIapart from solar irradiance which was collected using STRÅNG.

    C Models and Scalers

    For the comparison of different regression models the following were used:

    • Random Forest Regression

    • Linear Regression

    • Decision Tree Regression

    • Neural Network (MLP)

    • LASSO

    • K-Nearest Neighbour Regression

    The scalers that were used in combination with the above models:

    • PowerTransformer-Yeo-Johnson

    • RobustScaler

    • Standardization

    • Normalization

    • MinMaxScaler

    • MaxAbsScaler

    • QuantileTransformer-Normal

    • QuantileTransformer-Uniformed

    All models and scalers are imported from the Python Library sklearn.

    33

  • D Data Set

    D Data Set

    A picture of how a data set, used in creating the expected output, looks like. Every rowrepresents the meteorological data together with columns added in the preprocessingsteps for one hour.

    34

  • D Data Set

    Figure 7 Example of what a dataset for a PV system can look like

    35

  • E Decrease Tables

    E Decrease Tables

    Tables for 50% and 60% decrease are presented below. There exists four tables for eachdecrease. Where each table have different time horizon, e.g 21, 14, 7 or 3 days.

    Threshold % True Positive False Positive False Negative Average days until error found20 37 5 0 21.00025 39 3 0 21.00030 41 1 0 21.00035 42 0 0 21.28140 37 0 5 34.54845 20 0 22 62.47150 10 0 32 64.15055 3 0 39 72.56960 0 0 42 No Errors Found

    Table 7 40% decrease with a time horizon of 21 days.

    Threshold % True Positive False Positive False Negative Average days until error found35 40 2 0 14.32840 37 1 4 27.70445 26 0 16 55.58550 17 0 25 62.55455 7 0 35 64.25660 2 0 40 75.146

    Table 8 40% decrease with a time horizon of 14 days.

    Threshold % True Positive False Positive False Negative Average days until error found35 36 6 0 7.40540 36 5 1 11.66445 32 1 9 48.51050 25 0 17 59.32255 14 0 28 59.39360 7 0 35 54.798

    Table 9 40% decrease with a time horizon of 7-days.

    36

  • E Decrease Tables

    Threshold % True Positive False Positive False Negative Average days until error found35 15 27 0 3.43140 15 27 0 4.71445 16 22 4 21.11250 17 14 11 50.10055 19 9 14 56.81860 22 4 16 54.59165 24 2 16 63.56670 14 0 28 57.06675 5 0 37 56.008

    Table 10 40% decrease with a time horizon of 3 days.

    Threshold % True Positive False Positive False Negative Average days until error found20 37 5 0 21.00025 39 3 0 21.00030 41 1 0 21.00035 42 0 0 21.00040 42 0 0 21.00045 42 0 0 21.00050 42 0 0 21.00055 42 0 0 21.25660 37 0 5 36.440

    Table 11 60% decrease with a time horizon of 21 days.

    Threshold % True Positive False Positive False Negative Average days until error found35 40 2 0 14.00040 41 1 0 14.00045 42 0 0 14.00050 42 0 0 14.00055 42 0 0 14.28160 38 0 4 27.509

    Table 12 60% decrease with a time horizon of 14 days.

    37

  • E Decrease Tables

    Threshold % True Positive False Positive False Negative Average days until error found35 36 6 0 7.00040 37 5 0 7.00045 41 1 0 7.00150 42 0 0 7.02555 42 0 0 7.27260 41 0 1 11.246

    Table 13 60% decrease with a time horizon of 7-days.

    Threshold % True Positive False Positive False Negative Average days until error found35 15 27 0 3.00040 15 27 0 3.00045 20 22 0 3.00050 28 14 0 3.16155 33 9 0 3.18660 38 4 0 4.33265 34 2 6 33.65970 27 0 15 49.89275 26 0 16 59.502

    Table 14 60% decrease with a time horizon of 3 days.

    38

    IntroductionBackgroundAn Overview of Solar CellsFactors Affecting Power OutputSTRÅNG - A Solar Irradiance ModelMachine Learning ConceptsRegression vs ClassificationData PreprocessingEvaluation MetricsModel Validation

    Purpose, Aims, and MotivationDelimitations

    Related WorkA Statistical Method to Find Faults Comparing Simulated Output with Measured Output

    MethodProgramming LanguageMachine Learning Model - Random Forest RegressionScoring Method and Validation TechniqueMeteorological Data

    System StructureRequirements and Evaluation MethodsRegression Model TestingFault Detection Testing

    Data GatheringWeather Data from SMHISolar Irradiance Data from STRÅNGData Format

    Data PreprocessingRemoval of DataConstructing and Adding Data

    Fault DetectionExpected OutputFinding FaultsSimulating Faults

    Evaluation resultsRegression Model ResultsFault Detection Results

    Results and DiscussionRegression ModelFeature ImportancePrecision of the Meteorological DataExpected Output - Overshooting and UndershootingFault Detection - Analysis

    ConclusionsFuture WorkLibrariesExplanatory VariablesModels and ScalersData SetDecrease Tables