IMPL Data Analysis

3
Data Analysis by Checking, Clustering and Componentizing in IMPL (IMPL-DataAnalysis) i n d u s t r IAL g o r i t h m s LLC. (IAL) www.industrialgorithms.com September 2014 Introduction Presented in this short document is a description of our three separate techniques to analyze the data by checking, clustering and componentizing it before it is used by other IMPL’s routines especially in on-line/real-time decision-making applications. We also have other data consistency or analysis techniques which have been described in other IMPL documents and these relate to the application of data reconciliation and regression with diagnostics but require an explicit model (model-based) whereas the techniques below do not i.e., they are data-based techniques. Data Checking IMPL’s two separate data checking routines are well-known to the process industries especially when PLC and DCS applications are implemented. The first data checking routine has two parts: (1) check the range(or domain) of the data against its expected lower and upper absolute bounds similar to bounds-checking in compilers, and (2) check the rate-of-changeof the data against its lower and upper relative bounds i.e., the lower bound represents the minimum expected ROC from data sample to sample and the upper bound represents the maximum expected ROC. The second data checking routine is more sophisticated and is more relevant to continuous- processes. This is the technique of steady-state detection (SSD) and a new and accurate method can be found in Kelly and Hedengren (2013). It requires several key process or operating variables such as flows, holdups, temperatures, pressures, analyzers, etc. to be checked to see if they are statistically stationary or steady typically over a one-hour time- horizon. If a majority of the key variables are at steady-state then we can declare the process also to be at steady-state and then steady-state empirical and/or engineering models can be used to monitor and optimize the process. Data Clustering IMPL’s data clustering routine implements the Fuzzy C-Mean Clustering (FCMC) algorithm (Bezdek, et. al., 1984 and Bezdek et. al., 1999) and is a nonlinear iterative algorithm which usually requires multiple randomized re-starts to find the most accurate c-mean or k-mean clusters. Once the data has passed the above data checks, then it is possible to cluster the operating or process data (also using a set of key process variables) into several regions, groups or partitions which usually correspond to various and distinct operating modes or process operations such as minimum throughput, maximum yield, high conversion, low grade, etc. where the number of clusters can be estimated using the gap statistic from Tibshirani et. al. (2001) or are usually known a priori based on production/operating orders and logs where these modes/operations are planned/scheduled weeks to days in advance.

description

Presented in this short document is a description of our three separate techniques to analyze the data by checking, clustering and componentizing it before it is used by other IMPL’s routines especially in on-line/real-time decision-making applications. We also have other data consistency or analysis techniques which have been described in other IMPL documents and these relate to the application of data reconciliation and regression with diagnostics but require an explicit model (model-based) whereas the techniques below do not i.e., they are data-based techniques.

Transcript of IMPL Data Analysis

Page 1: IMPL Data Analysis

Data Analysis by Checking, Clustering and Componentizing in IMPL (IMPL-DataAnalysis)

i n d u s t r IAL g o r i t h m s LLC. (IAL)

www.industrialgorithms.com September 2014

Introduction Presented in this short document is a description of our three separate techniques to analyze the data by checking, clustering and componentizing it before it is used by other IMPL’s routines especially in on-line/real-time decision-making applications. We also have other data consistency or analysis techniques which have been described in other IMPL documents and these relate to the application of data reconciliation and regression with diagnostics but require an explicit model (model-based) whereas the techniques below do not i.e., they are data-based techniques. Data Checking IMPL’s two separate data checking routines are well-known to the process industries especially when PLC and DCS applications are implemented. The first data checking routine has two parts: (1) check the “range” (or domain) of the data against its expected lower and upper absolute bounds similar to bounds-checking in compilers, and (2) check the “rate-of-change” of the data against its lower and upper relative bounds i.e., the lower bound represents the minimum expected ROC from data sample to sample and the upper bound represents the maximum expected ROC. The second data checking routine is more sophisticated and is more relevant to continuous-processes. This is the technique of steady-state detection (SSD) and a new and accurate method can be found in Kelly and Hedengren (2013). It requires several key process or operating variables such as flows, holdups, temperatures, pressures, analyzers, etc. to be checked to see if they are statistically stationary or steady typically over a one-hour time-horizon. If a majority of the key variables are at steady-state then we can declare the process also to be at steady-state and then steady-state empirical and/or engineering models can be used to monitor and optimize the process. Data Clustering IMPL’s data clustering routine implements the Fuzzy C-Mean Clustering (FCMC) algorithm (Bezdek, et. al., 1984 and Bezdek et. al., 1999) and is a nonlinear iterative algorithm which usually requires multiple randomized re-starts to find the most accurate c-mean or k-mean clusters. Once the data has passed the above data checks, then it is possible to cluster the operating or process data (also using a set of key process variables) into several regions, groups or partitions which usually correspond to various and distinct operating modes or process operations such as minimum throughput, maximum yield, high conversion, low grade, etc. where the number of clusters can be estimated using the gap statistic from Tibshirani et. al. (2001) or are usually known a priori based on production/operating orders and logs where these modes/operations are planned/scheduled weeks to days in advance.

Page 2: IMPL Data Analysis

The FCMC algorithm is useful to assign real number probabilities (weights or memberships) to the expectation that the process is in a particular operating/production mode, region or regime. This information is useful given that multiple local and perhaps linear and simpler nonlinear models can be employed to monitor and optimize the process accurately similar to the approach found in Aumi and Mhaskar (2011) and Aumi et. al. (2011) to control a nonlinear batch process using multiple linear auto-regressive exogenous (ARX) dynamic models. Their approach is to use the clustering routine to determine which ARX model to use for the control prediction and manipulation, given the current state of the process, where transitions from one cluster, mode or region to the next result in probabilities lying between 0.0 and 1.0. When the clusters have been determined from the training, calibration or development data in terms of their cluster targets or mean-centers/centroids then the same routine can be used with testing, control or deployment data by fixing the targets and computing the weights or membership probabilities only in one (1) iteration. These weights can then be used to weight or proportion the predictions from the multiple localized models. With regard to dimensionality, if the number of key process variables used in the clustering is large then it is possible and practical to use the data componentizing routine described below to cluster only the larger principal components (Aumi and Mhaskar, 2011). Data Componentizing IMPL’s data componentizing routine implements the very well-known Principal Components Analysis and Regression (PCA/PCR) where the X-block of explanatory or regressor variables are componentized into one or more factors, latent variables or principal components called scores which are orthogonal, perpendicular or completely independent to each other. These scores are then used to regress one or more Y-block responses. The loadings that relate the scores to the X-block are computed in the PCA prior to computing the regression parameters relating the scores to the Y-block which can be argued is inferior to the related technique of Partial Least Squares or Projection to Latent Structures (PLS). This inferiority of PCR compared to PLS is primarily attributed to the fact that PCR requires more components than PLS for the same or similar R2 or percent (%) Y explained and is not as parsimonious as PLS. To address this issue, a unique and unpublished technique only found in IMPL is our Principal Component Regression Optimization (PCRO). This algorithm simultaneously computes the scores and regression coefficients together by minimizing a weighted sum of squares of residuals for both the X- and Y-blocks together with regularization similar to the Levenberg-Marquardt (trust-region) algorithm in nonlinear parameter estimation. PCR and PLS sequentially computes the latent variables or scores one at a time typically using NIPALS whereas PCRO, as mentioned, computes the loadings and regression parameters together into the same nonlinear optimization problem solved using an Equality-Constrained Successive Quadratic Programming (SQP) algorithm. The interesting feature of PCRO is that for the same or similar R2 or % Y explained, it requires less components than PLS. References Bezdek, J.C., Ehrlich, R., Full, W., “FCM: the fuzzy c-means clustering algorithm”, Computers & Geosciences, 10, 191, (1984). Bezdek, J.C., Keller, J., Krishnapuram, R., Pal, N.R., “Fuzzy models and algorithms for pattern recognition and image processing”, Kluwer Academic Publishers, TA1650.F89, (1999).

Page 3: IMPL Data Analysis

Tibshirani, R., Walther, G., Hastie, T., “Estimating the number of clusters in a data set via the gap statistic”, J.R. Statist. Soc. B., 63, 411-423, (2001). Aumi, S., Mhaskar, P., “Integrating data-based modeling and nonlinear control tools for batch process control”, American Control Conference, San Francisco, June, (2011). Aumi, S., Corbett, C., Mhaskar, P., “Data-based modeling and control of Nylon 6,6 batch polymerization ”, American Control Conference, San Francisco, June, (2011). Kelly, J.D., Hedengren, J.D., "A steady-state detection (SDD) algorithm to detect non-stationary drifts in processes", Journal of Process Control, 23, 326, (2013).