Analytic for data-driven decision-making in complex high-dimensional ...rx917b36b/... · Analytic...

Analytic for Data-Driven Decision-Making in

Complex High-Dimensional Time-to-Event Data

A Dissertation Presented

by

Keivan Sadeghzadeh

to

The Department of Mechanical and Industrial Engineering

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in

Industrial Engineering

Northeastern University

Boston, Massachusetts

Aug 2015

To the best wife, my love

Niloofar,

and to my shining star, my son

Kian.

ii

Acknowledgments

I wish to thank everyone who made this dissertation possible.

I would like to extend my deepest appreciation to my research advisor, Professor Nasser

Fard, for his technical and editorial advice which was essential for bringing this research to

its current form. I am especially grateful for his excellent direction and the enthusiasm he

expressed for my work.

This research greatly benefited from the useful suggestions and comments from the mem-

bers of my exceptional doctoral committee, Professors Vinod Sahney and Tucker Marion.

I wish to thank them for all their contributions.

I want to express my sincere gratitude to my beloved wife, Niloofar Montazeri, for provid-

ing me with motivation, understanding, support and encouragement at every step of this

pursuit.

Many thanks go to Kian as well, the best child a father could have, for giving me the op-

portunity to get this work done. I am extremely grateful for having him and already proud

of him.

iii

Abstract

In the era of big data, analysis of complex and huge data expends time

and money, may cause errors and misinterpretations. Consequently, inac-

curate and erroneous reasoning could lead to poor inference and decision-

making, sometimes irreversible and catastrophic events. On the other hand,

proper management and utilization of valuable data could signicantly in-

crease knowledge and reduce cost by preventive actions. In many areas,

there are great interests in time and causes of events. Time-to-event data

analysis is a kernel of risk assessment and has an inevitable role in pre-

dicting the probability of many events occurrence. In addition, variable se-

lection and classication procedures are an integral part of data analysis

where the information revolution brings larger datasets with more variables

and it has become more dicult to process the streaming high-dimensional

time-to-event data in traditional application approaches, specically in the

occurrence of censored observations. Thus, in the presence of large-scale,

massive and complex data, specically in terms of variables, applying proper

methods to eciently simplify such data is desired. Most of the traditional

variable selection methods involve computational algorithms in a class of non-

deterministic polynomial-time hard (NP-hard) that makes these procedures

infeasible. Although recent methods may operate faster, involve dierent

estimation methods and assumptions, their applications are limited, their

assumptions cause restrictions, their computational complexities are costly,

or their robustness is not consistent.

This research is motivated by the importance of the applied variable re-

duction in complex high-dimensional time-to-event data to avoid aforemen-

tioned diculties in decision-making and facilitate time-to-event data anal-

ysis. Quantitative statistical and computational methodologies using com-

binatorial heuristic algorithms for variable selection and classication are

proposed. The purpose of these methodologies is to reduce the volume of the

explanatory variables and identify a set of most inuential variables in such

datasets.

In Chapter 1, an introduction to this research, problem denition and ap-

proach outline are provided through literature review.

In Chapter 2, comprehensive review of time-to-event data analysis dealing

with censoring, survival and hazard function, and categories of this type

of data is presented. This review is completed with an applied denition of

decision-making process. Next, prominent data analysis tools and techniques

is presented including discretization and randomization processes, and data

reduction methods. Finally accelerated failure time model is introduced and

dened.

Chapter 3 provides Methodology I: Nonparametric Re-sampling Methods for

KaplanMeier Estimator Test. In this methodology, an analytical model for

logical transformation of the explanatory variable dataset is proposed and

a class of hybrid nonparametric variable selection methods and algorithms

through variable ineciency recognition are designed. The experimental re-

sults and comparison of them with well-known methods and simulation pat-

terns are presented next.

Chapter 4 describes Methodology II: Heuristic Randomized Decision-Making

Methods through Accelerated Failure Time Model. The methodology, in-

cluding analytical model for the normal transformation of the explanatory

variable dataset as well as heuristic randomized methods and algorithms are

proposed. Finally, the simulation experiment, result and analysis of variable

selection decision-making algorithms are presented.

In Chapter 5, Methodology III: Hybrid Clustering and Classication Meth-

ods using Normalized Weighted K-Mean is proposed. In this methodology,

two dierent methods in order to cluster and classify variables are designed.

Result and analysis of variable clustering and classication are included as

2

well.

In Chapter 6, concluding remarks, a summary of the proposed methods and

their advantages is presented. In addition, an overview for further studies

will be discussed.

The computer package used in this research is the MATLAB® R2011b pro-

gramming environment.

3

Contents

List of Figures 6

List of Tables 8

1 Introduction 10

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Problem Denition and Solution Approach . . . . . . . . . . . 15

1.4 Big Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . 17

1.5 Specialty and Application . . . . . . . . . . . . . . . . . . . . 20

1.5.1 Volume . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.5.2 Variety . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.5.3 Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.5.4 Veracity . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.5.5 Variability . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.5.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . 25

1.5.7 Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 Preliminaries, Concepts and Denitions 27

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Time-to-Event Data Analysis . . . . . . . . . . . . . . . . . . 28

2.2.1 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2.2 Survival and Hazard Function . . . . . . . . . . . . . . 30

2.2.3 Categories of Time-to-Event Data . . . . . . . . . . . . 33

2.2.4 KaplanMeier Estimator . . . . . . . . . . . . . . . . . 34

4

2.3 Accelerated Failure Time Model . . . . . . . . . . . . . . . . . 35

2.4 Decision-Making Process . . . . . . . . . . . . . . . . . . . . . 38

2.5 Data Analysis Tools and Techniques . . . . . . . . . . . . . . 40

2.5.1 Discretization Process . . . . . . . . . . . . . . . . . . 40

2.5.2 Randomization Process . . . . . . . . . . . . . . . . . . 42

2.5.3 Data Reduction Methods . . . . . . . . . . . . . . . . . 43

2.5.4 Feature Extraction and Feature Selection . . . . . . . . 45

2.5.5 Principal Component Analysis . . . . . . . . . . . . . . 46

3 Methodology I: Nonparametric Re-sampling Methods for Kaplan

Meier Estimator Test 51

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2 Analytical Model I . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2.1 Model Validation . . . . . . . . . . . . . . . . . . . . . 55

3.3 Methods and Algorithms for Variable Selection . . . . . . . . . 59

3.3.1 Singular Variable Eect . . . . . . . . . . . . . . . . . 61

3.3.2 Nonparametric Test Score . . . . . . . . . . . . . . . . 62

3.3.3 Splitting Semi-Greedy Clustering . . . . . . . . . . . . 65

3.3.4 Weighted Time Score . . . . . . . . . . . . . . . . . . . 67

3.4 Experimental Results and Analysis for Right-Censored Data . 68

3.5 Experimental Results and Analysis for Uncensored Data . . . 75

4 Methodology II: Heuristic Randomized Decision-Making Meth-

ods through Weighted Accelerated Failure Time Model 80

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.2 Analytical Model II . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3 Methods and Algorithms for Variable Classication . . . . . . 84

4.3.1 Ranking Classication . . . . . . . . . . . . . . . . . . 87

4.3.2 Randomized Resampling Regression . . . . . . . . . . . 89

4.4 Simulation Experiment, Results and Analysis for RC Method . 91

4.4.1 Simulation Design and Experiment . . . . . . . . . . . 91

4.4.2 Result and Analysis . . . . . . . . . . . . . . . . . . . . 93

4.5 Simulation Experiment, Results and Analysis for 3R Method . 95

5

4.5.1 Simulation Design and Experiment . . . . . . . . . . . 95

4.5.2 Result and Analysis . . . . . . . . . . . . . . . . . . . . 96

5 Methodology III: Hybrid Clustering and Classication Meth-

ods using Normalized Weighted K-Mean 100

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.2 Methods and Algorithms for Variable Clustering . . . . . . . . 101

5.2.1 Clustering through Cost Function . . . . . . . . . . . . 102

5.2.2 Clustering through Weight Function . . . . . . . . . . 107

5.3 Experiment and Result . . . . . . . . . . . . . . . . . . . . . . 110

6 Conclusion 112

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.2 Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Bibliography 115

Appendices 122

A Matlab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6

List of Figures

1.1 Big Data Analytics Options Plotted by Potential Growth and

Commitment [45]. . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.2 Seven dimensions of large-scale data. . . . . . . . . . . . . . . 21

1.3 Volume reduction ow diagram (n′ < n and p′ < p). . . . . . . 22

1.4 Binarization and Normalization through transformation models. 23

1.5 Sample visualization. . . . . . . . . . . . . . . . . . . . . . . . 26

2.1 Probability density, failure, survival, and hazard functions. . . 33

2.2 Kaplan-Meier estimate. . . . . . . . . . . . . . . . . . . . . . . 35

2.3 Data mining categories. . . . . . . . . . . . . . . . . . . . . . . 44

2.4 Schema of PCA transformation. . . . . . . . . . . . . . . . . . 48

3.1 Comparison of covariate correlations in the original and the

transformed dataset. . . . . . . . . . . . . . . . . . . . . . . . 56

3.2 Empirical survival function (solid blue), Cox PHM baseline

survival function for the original dataset (black dashed), and

Cox PHM baseline survival function for transformed logical

dataset (red dotted). . . . . . . . . . . . . . . . . . . . . . . . 58

3.3 Scree plot of the original PBC dataset including 17 variables

and 276 observations. . . . . . . . . . . . . . . . . . . . . . . . 70

3.4 Empirical transformed logical uncensored PBC dataset sur-

vival function (solid red) with 99% condence bounds (red

dotted). Subset of k variable with failed to reject Wilcoxon

rank sum test result (gray), and subset of k variable with re-

jected Wilcoxon rank sum test result (black). . . . . . . . . . . 72

7

3.5 Hybrid scatter plot - Upper right variables have less eciency. 76

3.6 Radar plot of inecient variables: Normalized ineciency re-

sults from the transformed logical uncensored PBC dataset

by SVE algorithm (red), SSG algorithm (green), and WTS

algorithms (yellow). . . . . . . . . . . . . . . . . . . . . . . . . 78

4.1 Normalized coecient score results for the simulation dataset

for 100 trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.2 Normalized coecient score results in 100 trials for the simu-

lation dataset in Normal model. . . . . . . . . . . . . . . . . . 97

4.3 Normalized coecient score results in 100 trials for the simula-

tion dataset in Uniform model (top), Normal model (middle)

and Exponential model (bottom). . . . . . . . . . . . . . . . . 98

5.1 Randomly clustering process applies on matrix V. . . . . . . . 104

5.2 Sample transformation. . . . . . . . . . . . . . . . . . . . . . . 105

5.3 Sample of time-based clustering. . . . . . . . . . . . . . . . . . 106

5.4 Schema of time-based binning. . . . . . . . . . . . . . . . . . . 108

8

List of Tables

3.1 Schema for high-dimensional time-to-event dataset. . . . . . . 53

3.2 Modied k-means clustering algorithm result for the trans-

formed logical uncensored PBC dataset. Number of clusters

is equal to ‖ pk‖. . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.3 Selected less ecient variables in all proposed methods and

comparison to RSF, ADD, and LS method results. . . . . . . . 73

3.4 Selected less ecient variables in all proposed methods and

comparison to simulation dened pattern. . . . . . . . . . . . 73

3.5 Performance of the proposed methods based on six simulation

numerical experiment. m is integer average number of selected

inecient variable and p is performance of method based on

100 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.6 Selected inecient variables in all proposed methods and com-

parison to NTS, RSF, ADD, and LS method results. . . . . . 77

3.7 Selected inecient variables in all proposed methods and com-

parison to NTS results and simulation dened pattern. . . . . 79

4.1 Performance of the proposed methods based on four numerical

simulation experiments with 200 replications. . . . . . . . . . . 94

4.2 Performance of the proposed methods based on three numer-

ical simulation experiments with 200 replications. . . . . . . . 99

5.1 High-dimensional time-to-event dataset . . . . . . . . . . . . . 102

5.2 Sample of time-based binning based on three techniques. . . . 109

5.3 Clustering by cost function . . . . . . . . . . . . . . . . . . . . 111

9

Chapter 1

Introduction

1.1 Overview

Advancement in technology has led to accessibility of massive and complex

data in many elds. Proper management and utilization of valuable

data could signicantly increase knowledge and reduce cost by preventive

actions, whereas erroneous and misinterpreted data could lead to poor in-

ference and decision-making. As an analytical approach, decision-making is

the process of nding the best option from all feasible alternatives. The ap-

plication of decision-making process in economics, management, psychology,

mathematics, statistics and engineering is obvious and this process is an im-

portant part of many science-based professions.

10

In many areas, there are great interests in time and causes of events. Birth in

demography and death in medical sciences, hospitalization in sociology and

arrest in criminology, promotion in management and bankruptcy in business,

revolution in political science and divorce in psychology, eclipse in astronomy

and failure in engineering science, metamorphosis in biology and earthquake

in geology, all are samples of an event which can occur at specic time points

which make change in some quantitative variable. The information of such

observations including event time, censoring indicator, and explanatory vari-

ables which are collected over a period of time is called time-to-event data.

This type of data is the outcome of many scientic investigations, experi-

ments and surveys. In the eld of data and decision analysis, it has become

more dicult to process the streaming high-dimensional time-to-event data

in traditional application approaches, specically in the presence of censored

observations.

Variable selection is a necessary step in a decision-making process dealing

with a large-scale data. There is always uncertainty when researchers aim to

collect most important variables specically in the presence of big data. Vari-

able selection for decision-making in many elds is mostly guided by expert

opinion [10]. The computational complexity of all the possible combinations

of the p variables from size 1 to p, could be overwhelming, where the total

number of combinations are 2p−1. For example, for a dataset of 20 explana-

11

tory variables, the number all possible combinations is 220 − 1 = 1048575.

This research presents a class of multipurpose practical methods to analyze

complex high-dimensional time-to-event data to reduce redundant informa-

tion and facilitate practical interpretation through variable eciency and

ineciency recognition. In addition, numerical experiments and simulations

are developed to investigate the performance and validation of the proposed

methods.

1.2 Literature Review

Analytics data-driven decision-making can substantially improve manage-

ment decision-making process. This process is increasingly are based on the

type and size of data, as well as analytic methods. It has been suggested

that new methods to collect, use and interpret data should be developed to

increase the performance of the decision makers [7, 35]. Data collection and

analysis in a data-driven decision-making plays crucial roles. In many cases

data are collected from dierent sources such as nancial reports and market-

ing data and they are combined for more informative decision-making. Con-

sequently, determining eective explanatory variables, specically in complex

and high-dimensional data provides an excellent opportunity to increase ef-

ciency and reduce costs.

12

Time-to-event data such as failure or survival times have been extensively

studied in engineering, economics, business and medical science. In addi-

tion, by the advent of modern data collecting technologies, a huge amount of

this type of data includes high-dimensional covariates. This massive amount

of data are increasingly accessible from various sources such as transaction-

based information, information-sensing devices, remote sensing technologies,

machines and logistics statistics, wireless sensor networks as well as quality

analytics in engineering, manufacturing, service operations and many other

segments. Unlike traditional datasets with few explanatory variables, anal-

ysis of datasets with high number of variables requires dierent approaches.

In this situation, variable selection techniques could be used to determine

a subset of variables that are signicantly more valuable to analyze high-

dimensional time-to-event datasets. If a data is compiled and processed

correctly, it can enable informed decision-makings [6, 13, 21,37,41,54,60].

The opportunity for manufacturing and services in the era of data is to

analyze their performance to enhance the quality. Quality dimensions of

products and services, even dened by quality experts [16, 58] or perceived

by customers [4] is summarized as performance, availability, reliability, main-

tainability, durability, serviceability, conformance, warranty, and price as well

as aesthetics and reputation. Access to valuable data for sophisticated an-

alytics can substantially improve management decision-making process. In

the eld of reliability, analyzing the collected data from dierent sources

13

such as customer reports as well as testing laboratories, and consequently

determining eective features and variables in the failure time, specically

in complex, high-dimensional and censored time-to-event data provide an

excellent chance for manufacturers to reduce costs, improve eciency and

ultimately improve the quality of their products by detecting failure causes

faster [37, 43].

Many management decision-making process involves complicated problems,

including multi-criteria and multi-variable, as well as risk and uncertainty

that require mathematical and statistical techniques, data analysis tools,

combinatorial algorithms, and quantitative computational methods. As an

analytical approach, decision-making is the process of nding the best option

from all feasible alternatives [10, 51] and advanced analytics are among the

most popular techniques used in massive data analysis and decision-making

process [45]. In many professional areas and activities, decision-making is

increasingly based on the type and size of data, rather than on experience and

intuition [7,35]. In a decision-making process, variable selection and variable

reduction are necessary procedures in dealing with a large-scale data. There

is always uncertainty when one aims to collect most important variables,

specically in the presence of big data.

14

1.3 Problem Denition and Solution Approach

Variable selection procedures are an integral part of data analysis. The infor-

mation revolution brings larger datasets with more variables. The demand

for variable selection as a fundamental strategy for data analysis is increasing.

Although a wide variety of variable selection methods have been proposed,

there is still plenty of work to be done. Many of the recommended proce-

dures have given only a narrow theoretical motivation, and their operational

properties need more systematic investigation before they can be used with

condence. The problem of variable selection and subset selection in large-

scale datasets arises when it is desired to model the relationship between a

response variable and a subset of explanatory variables but there is uncer-

tainty about which subset to use.

Most of the traditional variable selection methods such as Akaike information

criterion (AIC) [3] or Bayesian information criterion (BIC) [53] involve com-

putational algorithms in a class of non-deterministic polynomial-time hard

(NP-hard) that makes these procedures infeasible. Although recent methods

may operate faster, involve dierent estimation methods and assumptions,

such as Cox proportional hazard model [11], accelerated failure time [31],

Buckley-James estimator [8], random survival forest [27], additive risk mod-

els [36], weighted least squares [24] and classication and regression tree

(CART) [5], their applications are limited, their assumptions cause restric-

tions, their computational complexity are costly, or their robustness are not

15

consistent.

Since the results obtained from NP-hard problems are not robust, then

heuristic approaches are the only viable options for a variety of complex op-

timization problems such as variable selection in complex high-dimensional

time-to-event data which need to be solved in real-world applications. The

criteria for deciding whether to use a heuristic approach for solving a given

problem include optimality, completeness, accuracy, and execution time which

should be evaluated by experts. In mathematical optimization, a heuristic

is a technique is designed for a quick solution for a problem when classi-

cal methods are slow, or for nding an approximate solution when classical

methods fail to nd any robust solution. As a shortcut, this is achieved by

trading optimality, completeness, accuracy, or precision for speed. The ob-

jective of a heuristic is to produce a solution in a reasonable time frame that

is good enough for solving a problem. This solution may not be the best of

all the actual solutions to this problem, or it may simply approximate the

exact solution, but it does not require a prohibitively long time.

This study is motivated by the importance of above-mentioned variable se-

lection issue. The objective of this study is to design and propose combina-

tional procedures and methodologies including nonparametric and heuristic

methods for variable selection, classication and reduction in complex high-

dimensional time-to-event data through determining variable eciency and

16

ineciency. The purpose of proposed analytic models and heuristic methods

is to reduce the volume of the explanatory variables and identify a set of

most inuential variables on the time-to-event.

Proposed methodologies in this dissertation are (1) Nonparametric Re-sampling

Methods for Kaplan-Meier Estimator Test, including a class of hybrid non-

parametric variable selection methods, (2) Heuristic Randomized Decision-

Making Methods through Weighted Accelerated Failure Time Model for vari-

able classication and reduction, and (3) Hybrid Clustering and Classication

Methods using NormalizedWeighted K-Mean which is designed to cluster and

classify high number of variables. The achievements are to reduce limitations

in applications, relax restrictions caused by assumptions, decrease the cost

of computational procedure, and increase the accuracy and robustness.

1.4 Big Data Analytics

There are more than one trillion connected devices in the world and over 15

petabytes (15 × 1015 bytes) of new data are generated each day [25]. As of

2012, overall created data in a single day were approximated at 2.5 etabytes

(2.5× 1018 bytes) [26,39].

The size and dimension of data sets expand because they are increasingly

being gathered by information-sensing mobile devices, remote sensing tech-

17

nologies, genome sequencing, cameras, microphones, RFIDs , wireless sensor

networks, internet search as well as nance logs [13,21,54]. The world's tech-

nological per-capita capacity to store information has approximately dou-

bled every 40 months [22, 39] while annually worldwide information volume

is growing at a minimum rate of 59 percent [15].

In 2012 total software, social media, and IT services spending related to

big data and analytics reported over $28 billion worldwide by the IT re-

search rm, Gartner, Inc. This amount was forecast to drive $34 billion in

2013. Also it has been predicted that IT organizations will spend $232 bil-

lion on hardware, software and services related to Big Data through 2016.

In business, economics and other elds, decisions will increasingly be based

on data and analysis rather than on experience and intuition. In recent

years, researchers suggest that businesses should discover new ways to col-

lect and use data every day, develop the ability to interpret data to increase

the performance of their decision makers, where data-driven decision-making

(DDDM) methods emerge. According to MIT research in 2011, among 330

large publicly-traded companies in the study, those adopting this method

achieved productivity gains 5 to 6 percent higher than others [7, 35].

As a vast survey published in 2011, based on responses from organizations

for techniques and tool types using in big data analysis, advanced analytics

(e.g., mining, predictive) and data visualization are among the most popular

18

Figure 1.1: Big Data Analytics Options Plotted by Potential Growth andCommitment [45].

19

ones in terms of potential growth and commitment [45]. See Fig. 1.1. Ac-

cordingly, the strongest commitment among options for big data analytics is

to advanced analytics where closely related options such as predictive ana-

lytics, data mining, and statistical analysis have a similar commitment. On

the other hand, the strongest potential growth among options for big data

analytics is projected for advanced data visualization (ADV).

1.5 Specialty and Application

The application of the designed methodologies in this research is focused

on reliability and failure in manufacturing and business presenting applied

heuristic procedures for variable subset selection and classication in high-

dimensional, large-scale and complex time-to-event dataset. In this disser-

tation, a class of applied data reduction and appropriate variable selection

and classication in such dataset is proposed which enables to avoid dicul-

ties in decision-making and facilitate time-to-event and failure analysis. To

demonstrate the signicance and advantages of proposed mythologies in this

study over previous ones, seven dimensions of large-scale data are considered

as Fig. 1.2.

20

Volume

Variety

Velocity

Veracity

Variability

Visualization

Value

Figure 1.2: Seven dimensions of large-scale data.

1.5.1 Volume

The vast amount of data generated every year. This makes most datasets too

large to store and analyze using traditional technologies. High-dimensional

time-to-event data is a special dataset including event time and explanatory

variables which needs innovative methods to be analyzed when the number

of variables is more than traditional problems. In comparison to other meth-

ods, designed algorithms decrease the size of dataset in terms of explanatory

variables and observations in rst stage of analysis using clustering concept

as a preliminary process to volume reduction, shown in Fig. 1.3.

21

n by p DatasetClustering

Algorthms

n' by p'

DatasetInput Output

Figure 1.3: Volume reduction ow diagram (n′ < n and p′ < p).

1.5.2 Variety

Years ago, all data that was created was structured data tted in columns

and rows. Data today comes in many dierent complex formats and most

of the data that is generated is unstructured data. The wide variety of

data requires a dierent approach as well as dierent techniques to analyze

where each of dierent types of data require dierent types of analyses or

dierent tools to use. Proposed models in this study regulate and integrate

complex formats of explanatory variables as an unstructured time-to-event

datasets including dierent types of quantitative and qualitative data mixed

with censored and missed values, and transformed this set into a simple and

easy-to-analyze dataset using binarization and normalization as Fig. 1.4:

The advantages of transformation models in this study are:

1. Transformed data is understood and interpreted easily.

2. Enable experts to dene explanatory variables in a simple format be-

22

Input

Transformation

Complex

Dataset

Binarization

Model

Binary

Dataset

Normalization

Model

Normal

Dataset

Output

Figure 1.4: Binarization and Normalization through transformation models.

forehand.

3. In comparison to other relevant techniques, transformed data is pro-

cessed and analyzed faster and cheaper in terms of calculation.

4. The accuracy of mathematical calculation in data mining algorithms

in next stages is higher based on covariance reduction.

1.5.3 Velocity

The velocity or speed refers to how fast the data is generated, stored, ana-

lyzed and visualized. Data processors required time to process the data and

update the databases. In the era of big data era, real-time data is gener-

ated commonly which is a challenge. The proposed methods and algorithms

23

analyze and process high-dimensional time-to-event datasets faster than tra-

ditional method to obtain variable selection results.

1.5.4 Veracity

Incorrect data can cause a lot of problems. Data users need to be ensured

about the accuracy of the data and the performance of the data analyses. One

of the biggest problems with big data is the tendency for errors. User entry

errors, redundancy and corruption all aect the value of data. Accountability

and trust play major role in data science specically in big data problems.

The veracity is dened in three domains; Source, nature, and process of data.

In other words, veracity is the quality and understandability of the data.

Veracity isn't just about data quality, it's about data understandability. An

advantage of proposed methods is understandability where these methods

are potentially become tools which is user-friendly end-user interface.

1.5.5 Variability

There are four commonly used measures of variability: range, mean, variance

and standard deviation. In order to perform a proper variability analyses,

algorithms need to be able to understand the context and be able to decipher

the exact meaning of data. Transformation models which designed for early

24

stage of data analysis in this study play this role perfectly and make the big

dataset of time-to-event accessible for any variability analysis which has not

done before in this fashion.

1.5.6 Visualization

Visualization refers to making the vast amount of data comprehensible in a

manner that is easy to understand and interpreted. Raw data can be put to

use with the right visualizations. The best tools for visualization are com-

plex graphs that can include many variables of data while still remaining

understandable. The designed output of each part of all methodologies in

this study is an understandable hybrid graph where telling a complex data

analysis in a graph is very dicult but also extremely crucial. As a sample,

Fig. 1.5 depicts the visualization process:

1.5.7 Value

According to McKinsey, potential annual value of large-scale data to the US

Health Care is $300 billion. All available data will create a lot of value but

data in itself is not valuable at all. The value is in the analyses and in how the

data is turned into information and eventually turning it into knowledge. The

value relies on insights derived from data analyses for decision-making. Pro-

posed analytics in data-driven decision-making in complex high-dimensional

25

Crite rio n a

Crite rio n b

Crite rio n c

Tranasformed

Dataset

Visualization

Hybrid Tool

Algorithm A

Algorithm B

Algorithm C

Figure 1.5: Sample visualization.

time-to-event data enable users to extract valuable information from such

data through a class of novel methodologies.

26

Chapter 2

Preliminaries, Concepts and

Denitions

2.1 Overview

An introduction to time-to-event data analysis, decision-making process,

as well as a review of prominent data analysis tools and techniques,

and accelerated failure time model are presented in this chapter.

27

2.2 Time-to-Event Data Analysis

Time-to-event data analysis methods consider the time until the occurrence

of an event. This time can be measured in days, weeks, years, etc. This

analysis is used widely in engineering, sometimes is called life data or failure

time analysis since the main focus is in modeling the time it takes for a com-

ponents to fail. In social sciences as economics, event history analysis is the

used alternative where is also known as survival analysis in medical sciences.

In time-to-event data, subjects are usually followed over a specied time pe-

riod. Study of time-to-event data focuses on predicting the probability of

survival or failure. Examples of time-to-event analysis data are the lifetime

of mechanic devices, electronic components, or complex systems as well as in-

surance compensation claims, worker's promotions, and company bankrupt-

cies [31,33,59].

2.2.1 Censoring

Generally, data censoring occurs when some information available for a vari-

able but is not complete and analyzing a censored variable requires specic

procedures. In time-to-event data analysis there are usually some observa-

tions with no experience of the event during the study and the time-to-event

is incomplete for them. It is known that if the event happen to these cases,

the time of the event will be greater than the length time of the study/obser-

28

vation. In addition sometimes censoring occurs when there is no information

about over demand for a limited product or service which is discontinued.

Another censoring might happen due to lack of renement in the measure-

ment using experimental or industrial equipment.

One approach to deal with censored data is to set the censored observations

to missing or replace the unobserved value by mean value, minimum, max-

imum, or a randomly assigned value from the range when the number of

censoring observation is negligible otherwise these solutions can cause seri-

ous bias in estimates and discard potentially important information. There

are two categories of right censoring:

1- Singly censored data, which is included (a) time termination where study/ob-

serve is limited to a xed period of time due to some restrictions such as time

and cost, and (b) failure termination which study/observe continue until a

xed portion of failures. In time termination, Survival time recorded for

the failed subject is the times from the start of the experiment to its failure

which is called exact or uncensored observation. The survival time of any

remain subject is not known exactly but is recorded as at least the length

of the study period and called censored observation. In type I censoring, if

there are no accidental losses, all censored observations equal the length of

the study period. In failure termination each censored observation is equal

to the largest uncensored observation.

29

2- Progressively censored data, also called random censoring, which is the

period of study/observe is xed and subjects enter the study at dierent

times during that period.

Some aspects such censoring and non-linearity create diculty in analyzing

the data by traditional statistical models. Regression models cannot eec-

tively perform in the presence of the censored observations [32, 33]. The

right censoring is most commonly seen in time-to-event data. When an ex-

periment is censored from right, the subjects do not undergo full duration

of an experimental or study time. This could be due to many reasons, such

as a termination of the study, or incidents. Analyzing right-censored high-

dimensional time-to-event data to nd appropriate model or distribution is

time consuming, not economical, and most likely no theoretical distribution

adequately extracted. In this situation, nonparametric methods are consid-

ered to apply.

2.2.2 Survival and Hazard Function

By denition, the probability of an event occurring at time t is

f(t) = lim∆t→0

P (t ≤ T < t+ ∆t)

∆t(2.1)

In time-to-event or survival analysis, information on an event status and

30

follow up time is used to estimate a survival function S(t), which is dened

as the probability that an object survives at least until time t:

S(t) = P (an object survives longer than t) = P (T > t) (2.2)

From the denition of the cumulative distribution function (or failure func-

tion):

S(t) = 1− P (an object survives longer than t)

= 1− P (T ≤ t) = 1− F (t) (2.3)

Accordingly survival function is calculated by probability density function

as:

S(t) =

∫ ∞t

f(u)du (2.4)

The survival function has the following properties: S(t = 0) = 0, S(t) →

0 as t → 0, and S(u) ≤ S(t) for u ≥ t. In most of the applications, the

survival function is shown as a step function rather than a smooth curve.

Hazard function denes as the probability that if subject survive to time t,

and will succumb to the event in the next instant, as below:

h(t) = lim∆t→0

P (t ≤ T < t+ ∆t|T ≥ t)

∆t(2.5)

All four functions, probability density f(t), cumulative distribution function

or failure function F (t), survivorship or survival function S(t)and hazard

31

function h(t), mathematically have relations:

1- Hazard function from probability density survival functions:

h(t) =f(t)

S(t)(2.6)

2- Probability density function from Survival function:

f(t) = −dS(t)

dt(2.7)

3- Probability density function from Survival function:

f(t) = h(t)e(−∫ t0 h(u)du) (2.8)

4- Survival function from hazard function:

S(t) = e(−∫ t0 h(u)du) (2.9)

5- Survival function from hazard function:

h(t) = − d

dtlogS(t) (2.10)

Fig. 2.1 simply illustrated four aforementioned functions:

32

Figure 2.1: Probability density, failure, survival, and hazard functions.

2.2.3 Categories of Time-to-Event Data

Methods to analyze time-to-event data are parametric; which are based on

survival function distributions such as exponential, semi-parametric; which

do not assume knowledge of absolute risk and estimates relative risk, and non-

parametric; which are useful when the underlying distribution of the problem

is unknown. For moderate- to high-dimensional covariates, it is dicult to

apply parametric or semi-parametric methods [24]. Nonparametric methods

are used to describe survivorship in a population or for comparison of two or

more populations.

33

2.2.4 KaplanMeier Estimator

The KaplanMeier (KM) is the most commonly used nonparametric method

for the survival function and has clear advantages since it does not require

an approximation that results the division of follow-up time assumption [23,

33]. Nonparametric estimate of S(t) according to this estimator for distinct

ordered event times t1 to tn is:

S(t) =t∏i=1

(1− dini

) (2.11)

Where at each event time tj there are nj subjects at risk and dj is the number

of subjects which experienced the event. Let ci denote the number of subjects

censored between ti and ti+1. Then the likelihood function takes the form:

L =t∏i=1

[S (ti−1)− S (ti)]di [S (ti)]

ci (2.12)

For the conditional probability of surviving, if we dene πi = S(ti)S(ti−1)

, then

the maximum-likelihood estimation of πi is:

πi = 1− dini

(2.13)

Graphically, the Kaplan-Meier estimate is a step function with discontinu-

ities or jumps at the observed failure times as shown in Fig. 2.2. It has been

shown [44] that the K-M estimator is consistent. Completely nonparametric

nature of this estimator assures little or no loss in eciency using it in prac-

tice.

34

2 4 6 8 10 12 14 16 18 20 220

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

time, t

surv

ival

pro

babi

lity,

s(t

)

Figure 2.2: Kaplan-Meier estimate.

Accelerated failure time model is presented next.

2.3 Accelerated Failure Time Model

In time-to-event data analysis, when distribution of the event time as a re-

sponse variable is unknown, especially in presence of censored observations,

35

estimations and inferences become challenging and the traditional statisti-

cal methods are not practical. Cox proportional hazards model (PHM) [11]

and the accelerated failure time (AFT) [31] are frequently used in these cir-

cumstances. For a comprehensive comparison see [40]. Similar to a multiple

linear regression model, AFT assumes a direct relation between the logarithm

of the time-to-event and the explanatory variables.

As dened in Section 2.2, the event time may not always be observable. This

is known as censored data. In this situation the response variable is dened

as:

Ti = min (Yi, Ci), δi =

1, Yi ≤ Ci

0, Yi > Ci

i = 1 . . . n (2.14)

The censoring time C is independent of the event time Y . Time-independent

explanatory variableXj, j = 1 . . . p corresponding ith observation eects mul-

tiplicatively on the time-to-event Yi or additively on Ti where the relation

between the explanatory variables and the response variable is considered

linear. The AFT model generally is written as:

S(t|x) = S0(t

ψ (x))S0(.) (2.15)

is the baseline survival function and ψ(x) is an acceleration factor dened as

follows:

36

ψ(x) = eβX (2.16)

The coecient vector β is of length-p. The corresponding log-linear form of

the AFT model with respect to time is given by

ln (T ) = Xβ + ε (2.17)

Nonparametric maximum likelihood estimation does not work properly in the

AFT model [62], where in many studies least squares method has been used

to estimate parameters. To account for censoring, weighted least squares

method is commonly used, as ordinary least squares does not work for cen-

sored data in the AFTmodel. According to Stute [55] to obtain estimators us-

ing weighted least squares, a general hypotheses is under the assumption that

vector ε consists of random disturbances with zero mean so that E [ε|X] = 0.

In many studies the error distribution is unspecied [29]. Also for this vector

the assumption of homoscedasticity is supported which assumes a constant

variance for the errors. In addition, regardless of the values of the coecients

and explanatory variables, the log form assures that the predicted values of

T are positive.

Next, a review of decision-making process is presented.

37

2.4 Decision-Making Process

Decision-making is a process of making choices by setting objectives, gath-

ering information, and assessing alternative choices. This process includes

seven steps:

1. Identifying a problem or an opportunity

2. Gathering relevant information

3. Analyzing the problem and information

4. Establishing several possible options and alternative solutions

5. Evaluating alternatives for feasibility, acceptability and desirability.

6. Selecting a preferred alternative for future possible adverse consequences.

7. Implementing and evaluating the solution

Decision-making theories are classied based on two attributes: (a) Deter-

ministic, which deals with a logical preference relation for any given ac-

tion or Probabilistic, which postulate a probability function instead, and

(b) Static, which assume the preference relation or probability function as

time-independent or Dynamic which assume time-dependent events [9]. His-

torically, the Deterministic-Static decision-making is more popular decision-

making process specically under uncertainty. The assumption of decision-

making in this study falls in this category as well.

38

A major part of decision-making involves the analysis of a nite set of al-

ternatives described in terms of evaluative criteria. The mathematical tech-

niques of decision-making are among the most valuable factors of this pro-

cess, which are generally referred to as realization in the quantitative meth-

ods of decision-making [51]. With the increasing complexity and the variety

of decision-making problems due to the huge size of data, the process of

decision-making becomes more valuable. In most decision-making problems,

the multiplicity of criteria for evaluating the alternatives is pervasive [7, 51].

The application of decision-making process in economics, psychology, man-

agement, mathematics, statistics and engineering is obvious and this process

is an important part of all science-based professions. In this study, a decision-

making problem is assumed to be (1) deterministic, which deals with a log-

ical preference relation for any given action and (2) static, which consider

the preference relation or probability function as time-independent. A brief

review of time-to-event data analysis and survival function is following.

The performance of the decision-making process may be aected by a set of

dierent factors includes data size, the number of variables, and the presence

of special cases such as censored or missing data.

A review of commonly used data analysis tools and techniques in this study

is following.

39

2.5 Data Analysis Tools and Techniques

Discretization process, randomization process, and data reduction methods

are reviewed briey.

2.5.1 Discretization Process

Variables in a dataset potentially are a combination format of dierent data

types such as:

Dichotomous (Binary); that occur in one of two possible states,

often labelled zero and one. E.g., improved: yes/no or failed: yes/no.

Nominal; that the values can be assigned a code in the form of a labels

number which is countable but not be ordered or measured. E.g., male

and female (coded in order as 0 and 1)

Ordinal; that can be ranked or rated which could be even counted or

ordered but not measured. E.g., anxiety scale: none, mild, moderate,

and severe, with numerical values of 0, 1, 2, 3.

Categorical; that the value indicates membership in one of several

possible non-overlapping categories. E.g., color: black, brown, red, etc.

If the categories are assigned numerical values used as labels then it

will be synonym for nominal.

40

Discrete; that have only integer values. E.g., the number of a specic

component in warehouse.

Continuous (Interval); that is not restricted to particular values and

may take any value within a nite or innite interval. E.g., days of hos-

pital stay or human reaction or temperature.

There are many advantages of using discrete values over continuous as dis-

crete variables are easy to understand and utilize, more compact and more

accurate. Quantizing continuous variables is called discretization process.

In the splitting discretization methods, continuous ranges are divided into

sub-ranges by the user specied width considering range of values or fre-

quency of the observation values in each interval, respectively called equal-

width and equal-frequency. A typical algorithm for splitting discretization

process which quanties one continuous feature at a time generally consists of

four steps: (1) sort the feature values, (2) evaluate an appropriate cut-point,

(3) split the range of continuous values according to the cut-point, and (4)

stop when a stopping criterion satises.

In this study, discretization of explanatory variables of time-to-event dataset

assumed unsupervised, static, global and direct in order to reach a top-down

splitting approach and transformation of all types of variables in dataset

into a logical (binary) format. Briey, static discretization is dependent of

41

classication task, global discretization uses the entire observation space to

discretize, and direct methods divide the range of k intervals simultaneously.

For a comprehensive study of discretization process, see [34].

2.5.2 Randomization Process

Randomization is the process to select or assign subjects with the same

chance. Many procedures and methods have been proposed for the ran-

dom assignment such as simple, replacement, block, stratied, and covariate

adaptive randomization. To select an appropriate method to produce in-

terpretable and valid results in specic study, knowledge of advantages and

disadvantages of each method is necessary. In this study we briey review

simple and block randomization [56].

Simple randomization keeps complete randomness of the assignment of a

subject to a particular group based on a single sequence of random assign-

ments. This randomization approach is simple and easy to implement. Sim-

ple randomization may be conducted with or without replacement. In simple

replacement randomization, some selection criteria is benecial to use such

as subject stopping criteria which guarantees that one subject could not be

selected more than a specic number.

Block randomization is designed to randomize subjects into equal sample

42

sizes groups. This method is used to ensure a balance in sample size across

groups over time. Blocks are balanced with predetermined group assignment,

which keeps the numbers of subjects in each group similar where the block

size is determined by the researcher as well. Sometimes it is desired to assign

a group (cluster) of subjects together in a cluster randomized trial. The pro-

cedures for block randomizing clusters are identical to those described above.

2.5.3 Data Reduction Methods

Knowledge Discovery in Database (KDD) is dened as a process of acquiring

knowledge from raw data by selection, cleaning, processing, transformation,

data mining and evaluation. Data mining as an analysis step of KDD is the

computational process of discovering patterns in large datasets and covers

the classes of data selection, learning, clustering, classication, prediction

and summarizing. Categories of data mining are shown in Fig. 2.3.

These categories are grouped in two general analysis set: Descriptive and

Predictive. Clustering is the process of organizing objects into groups with

similar members according to one or more criterion. This process is a de-

scriptive analysis stage of data mining which deals with nding a structure

in a collection of unlabeled data. Therefore, a cluster is a collection of data

objects that are similar to one another. Cluster analysis has many applica-

tions which can be used in the data mining process as a preprocessing step

43

Figure 2.3: Data mining categories.

for other data mining algorithms [56]. Unsupervised clustering techniques

have been applied to many important problems.

Data reduction techniques are categorized in three main strategies, includ-

ing dimensionality reduction, numerosity reduction, and data compression

[56,57].

Dimensionality reduction as the most ecient strategy in the eld of large-

scale data deals with reducing the number of random variables or attributes in

the special circumstances of the problem. Dimensionality reduction methods

are mainly wavelet transforms and principal components analysis (PCA) [1,

30] which is also known as singular value decomposition (SVD) and Karhunen-

44

Loeve transform (KLV) [18] for transformation and projection of the original

data for elimination of a subset of the original data in terms of variables

covariance.

Numerosity reduction includes techniques substitute the original data by a

smaller alternative set. The methods are parametric which mainly retains

only data estimated parameters instead of original data such as regression

and nonparametric which stores reduced representations of the data such as

histograms, sampling and clustering.

Data compression strategy deals with transformations of data to reach the

compressed representation of the original data. These methods, depends on

the applied algorithm, may cause loose of data during the process of recon-

struction of the data.

2.5.4 Feature Extraction and Feature Selection

All dimensionality reduction strategies and techniques are also classied as

feature extraction and feature selection approaches.

Feature extraction is dened as transforming the original data into a new

lower dimensional space through some functional mapping such as PCA

and SVD [2, 42]. Most unsupervised dimensionality reduction techniques

45

are closely related to PCA, but this technique is not applicable to large

datasets [61].

Feature selection is denoted as electing a subset of the original data (fea-

tures) without a transformation based on some criteria in order to lter out

irrelevant or redundant features, such as lter methods, wrapper methods

and embedded methods [19,52].

Although a broad understanding of all methods is not easy, general knowl-

edge of the area categorizations is recommended in this study. Utilization

of any mathematical model, algorithm and technique requires a complete

knowledge of the purposes of solving the problem. For instance, selecting

a proper class of dimensionality reduction such as feature selection (PCA,

SVD, etc.) or feature extraction (ltering, wrapper and embedded) needs

appropriate characterizing of the problem. Similarly, determining whether

the data in the problem need a supervised or unsupervised method is a major

part of analysis process which leads one to use appropriate techniques.

2.5.5 Principal Component Analysis

Principal Component Analysis (PCA) is one of oldest and most well-known,

appropriate and applicable multivariate analysis techniques which is widely

used to reduce dimensionality of data, enable one to transforms data, even

46

linear or nonlinear, from higher to lower dimension as needed. This reduc-

tion, if performs correctly, eliminates noise, redundant and undesirable data,

retains the most of the useful data, saves resources such as time and mem-

ory used in data processing, as well as generating new signicant features,

variables and attributes [1, 30].

Basically, the principal function of PCA is rotational transformation the axes

of data space along lines of maximum variance, as shown in Fig. 2.4, when

the new axes are called principal component in order, based on their vari-

ance magnitude. The dimensionality reduction could be made by selecting

rst few principal components considering the fact of keeping necessary data.

According to Jollie [30] The central idea of principal component analysis is

to reduce the dimensionality of a dataset in which there are a large number

of interrelated variables, while retaining as much as possible of the variation

present in the dataset. This reduction is achieved by transforming to a new

set of variables, the principal components, which are uncorrelated, and which

are ordered so that the rst few retain most of the variation present in all of

the original variables.

In the literature, principal components analysis (PCA), singular value de-

composition (SVD) and Karhunen-Loeve transform (KLT) are mostly used

equivalently. In the eld of multivariate statistical analysis, these three terms

47

Figure 2.4: Schema of PCA transformation.

are very similar under certain circumstances but not necessarily identical [19].

Some other relevant terms with PCA in dierent elds are Hotelling trans-

formation, independent component analysis (ICA), factor analysis (FA), em-

pirical orthogonal functions (EOF), proper orthogonal decomposition (POD)

eigenvalue decomposition (EVD), empirical eigenfunction decomposition (EED),

empirical component analysis (ECA), spectral decomposition, and empirical

modal analysis.

Some of suggested widespread methods to determine the appropriate number

of components are [63]:

1) Bartlett's Test; which is an analogous statistical test for component anal-

ysis if the remaining eigenvalues are equal. This method should no longer be

employed.

48

2) Kaiser Criteria (K1); is the most commonly used method suggests to retain

the components whose eigenvalues are greater 1.0 and recommends to remove

the components with eigenvalue of less than the average of the eigenvalues.

Kaiser's method tended to severely overestimate the number of components.

3) Scree Test; is based on a graph of the eigenvalues. The scree test is sim-

ple to apply. The eigenvalues are plotted, a straight line is t through the

smaller values and those falling above the line are retained. In this method

some complications may occur such no obvious break point or more than one

break point. The scree test has been is eective when strong components

are present and it this test has been found to be most accurate with-larger

samples and strong components in non-complex data structure.

4) Minimum Average Partial (MAP); a method based on the application of

PCA and in the subsequent analysis of partial correlation matrices. The

method seeks to determine what components are common, and is proposed

as a rule to nd the best factor solution, rather than to nd the cuto point

for the number of factor. The MAP method was generally quite accurate

and consistent when the component saturation was high or the component

was dened by more than six variables.

5) Parallel Analysis (PA), involves a comparison of the obtained, real data

49

eigenvalues with the eigenvalues of a correlation matrix of the same rank

and based upon the same number of observations but containing only ran-

dom uncorrelated variables. This method is an adaptation of the K1 rule.

The general application of the PA method is dicult to recommend because

programs needed for its application are not widely available.

50

Chapter 3

Methodology I: Nonparametric

Re-sampling Methods for

KaplanMeier Estimator Test

3.1 Overview

In this chapter, rst proposed analytical model for transformation of the

explanatory variable dataset to reach the logical representation as a sort

of binary variables is presented. Next, in order to select most signicant

variables in terms of ineciency, designed variable selection methods and

heuristic clustering algorithms are introduced [4648].

51

3.2 Analytical Model I

A multipurpose and exible model for a type of time-to-event data with a

large number of variables when the correlation between variables is com-

plicated or unknown is presented. The objective is to simplify the original

covariate dataset into a logical dataset by transformation lemma. Next,

we show the validation of this designed logical model by correlation trans-

formation [47, 48]. The analytic model [47] and its following methods and

algorithms are potentially applicable solutions for many problems in a vast

area of science and technologies.

The random variables Y and C represent the time-to-event and censoring

time, respectively. Time-to-event data is represented by (T,∆, U) where

T = min (Y,C) and the censoring indicator ∆ = 1 if the event occurred, for

instance failure is observed, otherwise 0. The observed covariate U represents

a set of variables. Let denote any observations by (ti, δi, uij), i = 1 . . . n, j =

1 . . . p. It is assumed that the hazard at time t only depends on the survivals

at time t which assures the independent right censoring assumption [31].

The original time-to-event dataset may include any type of explanatory.

Many time-independent variables in U are binary or interchangeable with

a binary variable such as dichotomous variable. Also, interpretation of bi-

52

nary variable is simple, understandable and comprehensible. In addition, the

model is appropriate for fast and low-cost calculation. The general schema of

high-dimensional event history dataset includes n observations with p vari-

ables as shown in Table 3.1.

Table 3.1: Schema for high-dimensional time-to-event dataset.

Obs. # Time to Event Var. 1 Var. 2 . . . Var. p1 t1 u11 u12 . . . u1p

2 t2 u21 u22 . . . u2p

. . . . . . . . . . . . . . . . . .n tn un1 un2 . . . unp

Each array of p variables vectors will take only two possible values, canoni-

cally 0 and 1. As discussed in Chapter 2, discretization method is applied to

values by dividing the range of values for each variable into 2 equally sized

parts. We dene wij as an initial splitting criterion equal to arithmetic mean

of maximum and minimum value of uij for i = 1 . . . n, j = 1 . . . p.

wj =maxuij+ minuij

2i = 1 . . . n, j = 1 . . . p (3.1)

For any array uij in the n-by-p dataset matrix U = [uij], then assign a

substituting array vij as:

53

vij =

0, uij < wij

1, uij ≥ wij

(3.2)

The criteria wij could be dened by expert using experimental or historical

data as well. The proposed model assumes any array with a value of 1 as

desired for expert and 0 otherwise. In other words, vij = 0 represent the lack

of the j th variable in the ith observation. The result of the transformation is

an n-by-p dataset matrix V = [vnp] which will be used in the following meth-

ods and algorithms. Also, we dene time-to-event vector T = [tn] including

all observed event times. The logical model initially could be satised by

proper design of data collection process by based on Boolean logic to gener-

ate binary attributes.

Therefore, the result of the transformation is an n-by-p dataset matrix

V = [vnp] which will be used in the following methods and algorithms. Also,

we dene time-to-event vector T = [tn] including all observed failure times,

and S = [sn] as survival function. The logical model initially could be satis-

ed by proper design of data collection process by based on Boolean logic to

generate binary attributes.

54

3.2.1 Model Validation

Correlation Transformation

To validate the robustness of this model, rst we show that the change of cor-

relation between variables before and after transformation is not signicant

and the logical dataset has followed the same pattern and behavior as the

original; in terms of correlation of covariates. We dene correlation matrix for

each of original and transformed dataset based on Pearson product-moment

correlation coecient; M = [mij] and N = [nij] where i = 1 . . . n, j = 1 . . . p,

where nij and mij denote covariance of variables i and j for original and

transformed dataset respectively as follows:

nij =1

n− 1

n∑k=1

(uik − ui)(ujk − uj) i = 1 . . . p, j = 1 . . . p (3.3)

mij =1

n− 1

n∑k=1

(vik − vi)(vjk − vj) i = 1 . . . p, j = 1 . . . p (3.4)

where (uik, vik) and (ui, vi) represent value of variable i in observation k and

mean of variable i in each dataset respectively, and similarly the second

parenthesis in equations 3.3 and 3.4 are dened for variable j.

The experimental tted line for the scatter plot of mij and nij for anydataset

is y = a+ bx where b is positive small and a is not signicant. For instance,

Fig. 3.1 shows the primary biliary cirrhosis (PBC) dataset (Section 3.2) for

an experimental result of an uncensored data with the tted line of y =

55

−0.4 −0.2 0 0.2 0.4 0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

original dataset

tran

sfor

med

dat

aset

Figure 3.1: Comparison of covariate correlations in the original and thetransformed dataset.

0.0116 + 0.6356x.

The proposed logical model validation and verication of the robustness were

presented comprehensively in [46,48].

Cox Proportional Hazards Model (PHM) Baseline

The next validation is to compare the estimates of the survival and hazard

functions of the original and the transformed dataset to verify the proposed

56

model. Theoretically, a statistical test such as Wilcoxon rank sum test and

log-rank test determine whether we can assume two sample data are equal or

in other words, are collected from identical population. Based on Cox PHM

baseline, if the survival or hazard functions of the original and the trans-

formed dataset are similar, it proves that both datasets are not in-equivalent

in term of hazard rate and the transformation does not change the eect of

complete set of variables on survival function.

Although proportional hazards regression, as well as other multiple regres-

sion models will face several calculation errors dealing with high-dimensional

data especially in with a huge number of covariates, computational methods

show that the variation in the approximation of PHM baselines for equiva-

lent time-to-event datasets is negligible. Comparing empirical product limit

estimate for the survival function of the original with the PHM baseline

survival function of the original dataset and transformed logical dataset in

many tested sample datasets in this study shows that the transformation

of the dataset does not change the nature of the covariates in terms of the

survival function and the hazard function.

Fig. 3.2 depicts the empirical survival function, Cox PHM baseline survival

function for original dataset, and Cox PHM baseline survival function for the

transformed logical uncensored PBC dataset (Section 3.3). The Wilcoxon

test score of comparing nonparametric baseline survival function of the orig-

57

inal dataset and transformed logical dataset is 0.9704 and according to the

test concept, it shows that these two functions are not unequal. This result

proves that we should expect no signicant change in the covariate correla-

tions due to the proposed transformation. We applied similar validation to

many datasets in this research to verify this analytic model.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

time,t

surv

ival

pro

babi

lity,

s(t)

Figure 3.2: Empirical survival function (solid blue), Cox PHM baseline sur-vival function for the original dataset (black dashed), and Cox PHM baselinesurvival function for transformed logical dataset (red dotted).

In order to select the most signicant variables in terms of ineciency, de-

signed methods and heuristic algorithms are presented next.

58

3.3 Methods and Algorithms for Variable Se-

lection

We design a class of methods applying on proposed logical model to select

inecient variables in a high-dimensional event history datasets. The ma-

jor assumption to design appropriate methods for this purpose is that the

variable which is completely inecient solely can provide a signicant per-

formance improvement when engaged with others, and two variables that

are inecient by themselves can be ecient together [19]. Based on this

assumption, we design three methods and heuristic algorithms to select in-

ecient variables in event history datasets with high-dimensional covariates.

We use Kaplan-Meier estimator in this study to estimate survival probabili-

ties as a function of time. A nonparametric test could be used to test a null

hypothesis that whether two samples are drawn from the same distribution,

as compared to a given alternative hypothesis. Among many nonparametric

tests for comparing survival functions for aforementioned propose, we use

log-rank test in our methods as a best t for comparison of two nonparamet-

ric distributions. This test is the most commonly used for a typical study

under dierent models for the relationship between the groups [24,32].

The n-by-p matrix V is the prepared transformed logical dataset according

59

to Section 3.1, where n is the number of observations, p is the number of

variables, and k is the estimated subset size to select for calculation parts in

the algorithms.

Recalling V which is constructed by k observation vectors corresponding to

each of the variables, D = [dkp] as a k -by-p matrix is a selected subset of V

and k is dened as the number of observations in any subset of V , where

k ≤ n. For any variable i, we dene vectorO i as a time-to-event vector which

includes failure times of any observation j the value of vij is one. Similarly,

we dene vector Z i including failure times of any observation j where the

value of vij is zero. The vectors R and S are dened as follow:

ri =

ti,∑di. ≥ 0

0, Otherwisei = 1 . . . n (3.5)

si =

ti,∏di. = 1

0, Otherwisei = 1 . . . n (3.6)

Vector R is constructed by all non-zero arrays r and similarly vector S is

constructed by all non-zero arrays s.

A class of methods and algorithms to select inecient variables are following:

60

3.3.1 Singular Variable Eect

The objective of Singular Variable Eect (SVE) method is to determine the

eciency of a variable by analyzing the eect of the presence of any variable

singularly in comparison with its absence in a transformed logical dataset.

For p variable, we aim to set vector ∆ = [δi ] where i = 1 . . . p to rank the

eciency of the variables. The preliminary step for the highest eciency in

this method is to initially clustering the variables based on the correlation

coecient matrix of original dataset, M , and choose a representative vari-

able from each highly correlated cluster and eliminate the other variables

from the dataset. For instance, for any given dataset, if three variables are

highly correlated, only one of them is selected randomly and the other two

are eliminated from the dataset. The result of this process assures that the

remaining variables for applying methods and heuristic algorithms are not

highly correlated.

As an outcome of the SVE procedure, if one hopes to reduce the number

of variables in the dataset for further analysis, could eliminate less ecient

identied variables or if aims to concentrate on a reduced number of vari-

ables, could choose a category of more ecient identied variables as well.

Heuristic algorithm for SVE method is:

for i = 1 to p do

61

Calculate O i and Z i for variable i observation vector in dataset V

Compare T and O i with Wilcoxon rank sum test

Save the test score for variable i as αi

Compare T and O i with Wilcoxon rank sum test

Save the test score for variable i as βi

Calculate δi = αi - βi

end for

Return ∆ = [δp] as the variable eciency vector

3.3.2 Nonparametric Test Score

The log-rank test score variable selection method is applied for selecting a

subset of the best and the worst variables in terms of eciency. The Non-

parametric Test Score (NTS) method is a variable clustering technique which

selects set of size k variables from the transformed logical dataset V and cal-

culates the score of each variable in two levels. The rst level is to determine

the priority of the variable eciency via the score reached by the frequency

of the presence of each variable from the rejected subsets in comparison with

the original time-to-even vector T . We code this level of calculation with

letter F. The second level rates the variables by the cumulative score of each

variable from comparisons of selected subsets of all nonparametric test re-

sults with the original time-to-even vector T . This level acts as a searching

62

procedure to detect the less ecient variables. The code which this level is

denoted by is the letter C. Randomization (RN) algorithm randomly chooses

a dened l subset of k from the V , transformed logical dataset of p vari-

able. We dene a randomization dataset matrix Ψ = [ψlk] where each row

is formed by k variable identication numbers in any selected subsets for

overall l subsets. Heuristic algorithm of NTS method level F is:

for i = 1 to q do

Compose the dataset D i for variable set i in Ψ including variables

ψij where j = 1 to k

Calculate Ri over the dataset D i

Compare T and Ri with log-rank test

Save the test score for variables in subset i as ψi(k+1)

end for

Eliminate rows of Ψ where the array k+1 of the row has a value of more

than 0.05, and call this new set Ψ*

Count the contribution (presence) of each variable h based on its

identication number in the

Ψ* rst k columns and save as γh

Return Γ = [γp] as the variable eciency vector

63

Heuristic algorithm of NTS method level C is presented is:

for i = 1 to l do

Compose the dataset D i for variable set i in Ψ including variables

ψij where j = 1 to k

Calculate Ri over the dataset D i

Compare T and Ri with log-rank test

Save the test score for variables in subset i as ψi(k+1)

end for

Assume Ω = [ωp] as the reverse variable eciency

vector where initially each array as the cumulative contribution score

corresponding to a variable is zero

for i = 1 to l do

for j = 1 to k do

Add the value of ψi(k+1) to the cumulative contribution score ωp

of the variable i based on its identication number = ψij

end for

end for

Return Ω = [ωp] as the variable ineciency vector

64

3.3.3 Splitting Semi-Greedy Clustering

Splitting Semi-Greedy (SSG) method to select an inecient variable subset

is proposed. A clustering procedure through randomly splitting approach

to select the best local subset according to a dened criterion incorporated.

In this method we use block randomization which is designed to randomize

subjects into equal sample sizes groups. A nonparametric test is used to

test a null hypothesis that whether two samples are drawn from the same

distribution, as compared by a given alternative hypothesis. Wilcoxon rank

sum test is used in this method.

The concept of this method is inspired by the semi-greedy heuristic [14, 20]

and tabu search [17]. Criterion of this search is similar to The Nonpara-

metric Test Score (NTS) method [48] which is to collect the most inecient

variable subset via Wilcoxon rank sum test score. At each of l trials, all

p variables from the transformed logical dataset V are randomly clustered

into subsets of size k variables, where one cluster possibly contains less than

k variables and the number of clusters is equal to dp/ke. To calculated score

summation for each variable over all trials, a randomization dataset matrix

Ξ = [ξlk] where each row is formed by k variable identication numbers

in any selected subsets for all l trials. Comprehensive experimental results

for validation of the proposed methods by comparison with similar methods

are presented in Sections 3.4 and 3.5. Heuristic algorithm for SSG method is:

65

for i = 1 to l do

Split the data into equally sized subsets

Compose the dataset D for each subset

Calculate R over the D for each subset

Compare T and R with Wilcoxon rank sum test and save the test score

for each subset one by one

Select a subset with the highest test score

Save the test score for variables in the selected subset as ξi(k+1)

end for

Assume Θ = [θp] as the reverse variable eciency vector

where initially each array as the cumulative contribution score


for i = 1 to l do

for j = 1 to k do

Add the value of ξi(k+1) to the cumulative contribution score θp

of the variable i based on its identication number = ξij

end for

end for

Return Θ = [θp] as the variable ineciency vector

66

3.3.4 Weighted Time Score

The Weighted Time Score (WTS) method is a variable clustering technique

which selects set of size k variables from the transformed logical dataset V

and calculates the score of each variable. The rst step is to determine the

observations in a selected subset which all k variables are 1 for that obser-

vation and eliminate other observation from subset. Cumulative time score

over the vector T credit each of variables in the subset. Final score of all

variables is reached by aggregation of those credits in l trials. Randomization

algorithm randomly chooses a dened l subset of k from the V , transformed

logical dataset of p variable. We dene a randomization dataset matrix Ψ

= [ψlk] where each row is formed by k variable identication numbers in any

selected subsets for overall l subsets. Heuristic algorithm for WTS method is:

for i = 1 to l do

Compose the dataset D i for variable set i in Ψ

including variables ψij where j = 1 to k

Calculate S i over the dataset D i

Calculate∑

ti for S i as a time score

Save the time score for variables in subset i as ψi(k+1)

end for

Assume Ω = [ωp] as the reverse variable eciency vector

where initially each array as the cumulative contribution score

67


for i = 1 to l do

for j = 1 to k do

Add the value of ψi(k+1) to the cumulative contribution score ωp

of thevariable i based on its identication number = ψij

end for

end for

Return Ω = [ωp] as the variable ineciency vector

Comprehensive experimental results for validation of the proposed methods

by comparison with similar methods for right-censored dataset (Section 3.4)

and uncensored dataset (Section 3.5) are presented next.

3.4 Experimental Results and Analysis for Right-

Censored Data

To evaluate the performance of the proposed methods for right-censored data,

NTS and SSG methods and algorithms under dierent types of datasets in-

cluding collected and simulated data is investigated. In order to obtain an

estimation of desired number of variables in any selected subset of these

methods for calculations in hybrid algorithms, we use principal component

68

analysis (PCA) scree plot criterion [63]. The criterion for this evaluation is

based on eigenvalue of components in a dataset. The cut-o in the scree plot

is interpreted as a set of signicant eigenvalues among all components which

leads to determine k, number of ecient variables in a given dataset. We

apply this technique on the original dataset to determine k.

First, the well-known primary biliary cirrhosis (PBC) dataset (Fleming and

Harrington 1991) is considered as the sample collected dataset. These data

are from a double-blinded randomized trial including 312 observations. This

right-censored dataset contains 17 variables in addition to censoring infor-

mation, identication number and event times for each observation. Experi-

mental results for the 276 observations with complete records are presented

(36 out of 312 observations are incomplete). For the original PBC dataset,

approximate value of k is 3. This value is the subset size in the algorithms

applied on V transformed logical dataset n-by-p matrix, shown in Fig. 3.3.

Another set up before applying methods and algorithms is to preliminary de-

termine which variables are more behaving similarly, using modied k-means

clustering algorithm. Basically, k-mean algorithm partitions the points in the

n-by-p dataset matrix into k clusters. The default of calculation is Squared

Euclidean distance. Our modied k-mean algorithm changes the dimension

of transformed logical dataset and returns clusters of the variables in terms

of the observation. The result gives us an acceptable view of expected clus-

69

0 2 4 6 8 10 12 14 16 180

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

4

component number

eige

nval

ue

Figure 3.3: Scree plot of the original PBC dataset including 17 variablesand 276 observations.

tering in the next parts. Number of clusters is calculated by the quotient`s

nearest integer value of the number of the dataset variables p divided by the

number of the subset variables k. The result of this preliminary evaluation

for the transformed logical uncensored PBC dataset is shown in Table 3.2.

To verify the performance of the proposed methods, the result of methods

and algorithms for the transformed logical PBC dataset is compared with the

Random Survival Forest (RSF) method results [27,28], Additive Risk Model

70

Table 3.2: Modied k-means clustering algorithm result for the transformedlogical uncensored PBC dataset. Number of clusters is equal to ‖ p

k‖.

Cluster # Variable #1 4, 6, 72 3, 5, 10, 173 14 25 8, 9, 11, 12, 13, 14, 15, 16

(ADD) [36], and Weighted Least Square (LS) [24] results for variable selec-

tion in PBC dataset are given in Table 3.3. A comprehensive comparison

of RSF, ADD and LS performance with other relevant methods in high-

dimensional time-to-event data analysis such as Cox's Proportional Hazard

Model, LASSO and PCR has been presented in [24,28,36].

The survival function results of NTS-RN algorithm for 100 trials on the trans-

formed logical uncensored PBC dataset is shown in Fig. 3.4.

Each number in Tables 3.3 and 3.4 represents a specic variable in experi-

ment dataset. For example, in Table 3.3, variable #15 is among the selected

less ecient variables by methods NTS (F), SSG, RSF and LS.

From the results shown on Tables 3.3, the NTS (F) method has closest per-

formance to the SSG. This is an advantage of these methods that more than

80% of inecient variables which has been detected by other methods (RSF,

71

Figure 3.4: Empirical transformed logical uncensored PBC dataset survivalfunction (solid red) with 99% condence bounds (red dotted). Subset of kvariable with failed to reject Wilcoxon rank sum test result (gray), and subsetof k variable with rejected Wilcoxon rank sum test result (black).

LS and ADD) are collected by proposed algorithms at signicantly short pe-

riod of calculation. The robustness of this class of methods has examined for

several sample collected time-to-event data with high-dimensional covariates

in this study.

Continuing the validation of the proposed methods, we set n = 400 observa-

tions and p = 25 variables and simulated event times from a pseudorandom

algorithm. In addition to constant and periodic binary numbers, normal and

exponential distributed pseudorandom numbers are generated as indepen-

dent values of explanatory variables. The censored observations were also

72

Table 3.3: Selected less ecient variables in all proposed methods and com-parison to RSF, ADD, and LS method results.

Method Selected Less Ecient VariablesNTS (F) 1, 3, 5, 6, 10, 15, 17NTS (C) 2, 5, 10, 11, 13, 17SSG 1, 3, 5, 10, 15, 17RSF 1, 3, 5, 12, 13, 14, 15, 17ADD 1, 2, 5, 12, 14LS 1, 2, 3, 14, 15, 17

generated independent of the events. Additionally, some variables are set as

a linear function of event time intentionally. We present the results of meth-

ods and algorithms applying the simulated data in Table 3.4. These results

are compared with the simulation dened pattern and the comparison veries

the performance of all proposed methods and algorithms. In the simulation,

dened pattern includes ve inecient variables, No. 5, 10, 15, 20, and 25.

Table 3.4: Selected less ecient variables in all proposed methods and com-parison to simulation dened pattern.

Method Selected Less Ecient VariablesNTS (F) 5, 10, 25NTS (C) 5, 10, 25SSG 5, 10, 25

Denition 5, 10, 15, 20, 25

Ineciency analysis results for the simulation experiment shows that vari-

73

ables with identication number 5, 10 and 25 are detected as less ecient

variables among all 25 simulated variables by NTS (F), NTS (C) and SSG

methods. In this example, one could eliminate these less ecient identied

variables if the objective is to reduce the number of variables in the dataset

for further analysis.

In addition, a set of simulation numerical experiments are presented to eval-

uate performance of the proposed methods as shown in Table 3.5. For that,

six dierent simulation models are dened. Setting of these models are as

explained above for n = 400 observations and p = 25 variables following

same data generating algorithm. Number of signicant inecient variables

in each simulation model in order are m = 4, 5, 6 as presented in De-

nition in Table 3.5. The simulations are repeated 100 times independently.

For each method, m represents integer average number of determined and

selected inecient variables and p denotes performance of the method, for

100 trails, where

p = mmethod/mdefinition (3.7)

and p is the average of performance of each method. The adjusted censoring

rates for all simulation numerical experiments are 50± 5%.

Graphical tool for interpretation of the results for proposed methods, hybrid

scattering for variable ineciency for three criteria as (a) NTS (F) score, (b)

NTS (C) score and (c) SSG score for PBC data experiment results is displayed

74

Table 3.5: Performance of the proposed methods based on six simulationnumerical experiment. m is integer average number of selected inecientvariable and p is performance of method based on 100 replications.

#1 #2 #3 #4 #5 #6 AveMethod m p m p m p m p m p m p pNTS (F) 3 0.75 3 0.75 3 0.60 4 0.80 4 0.67 5 0.83 0.73NTS (C) 3 0.75 3 0.75 4 0.80 4 0.80 5 0.8 5 0.83 0.79SSG 2 0.50 3 0.75 3 0.60 4 0.80 4 0.67 4 0.67 0.66

Denition 4 4 5 5 6 6

in Fig. 3.5. Each variable is exposed by a circle icon where two axes repre-

sent the standardized criterion (a) and (b) and the diameter of each circle

demonstrate the standardized criterion (c). These ineciency score criteria

are standardized in the range of [0, 25] and for clear graphics presentation,

plot range is [−5, 25]. Therefore, it is interpreted from the hybrid scatter plot

that each variable with larger diameter and more distance from the center has

less eciency and is an ideal candidate to remove from dataset if it is desired.

3.5 Experimental Results and Analysis for Un-

censored Data

To evaluate the performance of the designed methods for uncensored data,

rst PBC dataset is considered as the sample collected dataset. The pur-

pose of this section is to examine and validate the proposed methods in

75

−5 0 5 10 15 20 25−5

0

5

10

15

20

25

1

2

3

4

56

7

89

10

1112

13

14

15

16

17

criterion a

crite

rion

b

Figure 3.5: Hybrid scatter plot - Upper right variables have less eciency.

this chapter with uncensored time-to-event data. These dataset includes 111

uncensored complete observations and 17 explanatory variables. To verify

the performance of the proposed methods for uncensored data, the result of

these methods and algorithms for the transformed logical uncensored PBC

dataset including 111 uncensored complete observations and 17 explanatory

variables is compared with the results of Nonparametric Test Score (NTS)

method [48], Random Survival Forest (RSF) method [27, 28], Additive Risk

Model (ADD) [36], and Weighted Least Square (LS) method [24] for similar

76

dataset variable selection, given in Table 3.6.

Each number in Tables 3.6 and 3.7 represents a specic variable in experi-

ment dataset. For example, in Table 3.6, variable #1 is determined from the

results as an inecient variable by all methods.

Table 3.6: Selected inecient variables in all proposed methods and compar-ison to NTS, RSF, ADD, and LS method results.

Method Selected Inecient VariablesSVE 1, 3, 5, 10, 13, 17SSG 1, 3, 5, 10, 15, 17WTS 1, 3, 5, 10, 15, 17NTS 1, 3, 5, 6, 10, 15, 17RSF 1, 3, 5, 12, 13, 14, 15, 17ADD 1, 2, 5, 12, 14LS 1, 2, 3, 14, 15, 17

From the results shown on Tables 1, the SSG and WTS methods have a same

performance. More than 80% of inecient variables which has been detected

by other methods (NTS, RSF, LS and ADD) are collected by proposed al-

gorithms at signicantly shorter calculation period, where the robustness of

this class of methods has examined for several sample datasets.

To show variable ineciency through three designed methods SVE, SSG,

and WTS, graphical representation for the experiment results for uncen-

sored PBC dataset is depicted in Fig. 3.6. Each variable with larger radius

77

and more distance from the center is less ecient and an ideal candidate to

remove from dataset if it is desired.

Figure 3.6: Radar plot of inecient variables: Normalized ineciency resultsfrom the transformed logical uncensored PBC dataset by SVE algorithm(red), SSG algorithm (green), and WTS algorithms (yellow).

A simulation is designed to examine and validate the proposed methods for

uncensored data. We set n = 400 observations and p = 15 variables and

simulated event times from a pseudorandom algorithm. We also set rst ve

variables inecient, where rst two are absolutely inecient. Some variable

78

vectors are set as a linear function of event time data in addition to constant

and periodic binary numbers as well as normal and exponential distributed

pseudorandom numbers as independent values of explanatory variables. The

results of methods and algorithms applying the simulated data are presented

in Table 3.7. These results are compared with the results from NTS method.

From the simulation dened pattern as shown in Table 3.7, the comparison

veries the performance of all proposed methods.

Table 3.7: Selected inecient variables in all proposed methods and compar-ison to NTS results and simulation dened pattern.

Method Selected Inecient Variables (No.)SVE 1, 2, 3, 10SSG 1, 2, 3WTS 1, 2, 3, 5, 12NTS 1, 2, 5

Denition 1, 2, 3, 4, 5

Ineciency analysis results for the simulation experiment shows that vari-

ables with identication number 1, 2 and 3 are detected as inecient variables

by all proposed methods. To reduce the number of variables in the dataset

for further analysis, these explanatory variables are the best candidates to

be eliminated from the dataset.

79

Chapter 4

Methodology II: Heuristic

Randomized Decision-Making

Methods through Weighted

Accelerated Failure Time Model

4.1 Overview

This chapter presents an analytical model for normalization the explana-

tory variable dataset to reach an appropriate representative dataset.

Next, designed heuristic randomized decision-making methods and algorithms

80

to select most signicant variables in terms of eciency through accelerated

failure time model are introduced. Then numerical simulation experiment

is developed to investigate the performance and validation of the proposed

methods [12,49].

4.2 Analytical Model II

A simple analytic model to analyze complicated right-censored time-to-event

data with high-dimensional explanatory variables in order to facilitate vari-

able selection procedure in decision-making procedure is developed. Trans-

forming data values into appropriate computational form is a commonly used

approach. There are many advantages of using discrete values over continu-

ous; as discrete variables are easy to understand and utilize, more compact,

and more accurate. In this study, for a time-to-event dataset including dis-

crete values, a normalization analytical model is dened.

As explained in Chapter 2, the random variables Y and C represent the

time-to-event and censoring time, respectively. Time-to-event data is rep-

resented by (T, ∆, U) where Ti = min (Yi, Ci) and the censoring indi-

cator di = 1 if the event occurs, otherwise di = 0. The observed co-

variate U represents a set of variables. Let denote any observations by

(ti, di, uij) , i = 1 . . . n, j = 1 . . . p. It is assumed that the hazard at time t

only depends on the survivals at time t which assures the independent right

81

censoring assumption [31].

Data normalization is a method used to scale data and standardize the range

of independent variables or features of data to allow the comparison of cor-

responding normalized data, also known as feature scaling. As a simple and

well-known rescaling method, unity-based normalization scales in [0, 1] range.

One problem in this process is the appearance of outliers that takes extreme

values and a solution to remove the outliers before normalization could be

using a threshold as below algorithm:

1. Arrange data in order

2. Calculate rst quartile Q1, third quartile Q3, and the interquartile

range IQR = Q3−Q1

3. Compute the range of Q3 + 1.5(IQR) and Q1− 1.5(IQR)

4. Remove any value outside this range as an outlier

The original dataset may include any type of explanatory data as binary,

continuous, categorical, or ordinal data. For any n-by-p dataset matrix

U = [uij], there are n independent observations and p variables. To normal-

ize the dataset, we dene minimum and maximum value of each explanatory

variable value in vector Uj for j = 1 . . . p as:

82

uminj = min uij , umaxj = max uij i = 1 . . . n (4.1)

For any array uij, we allocate a normalized substituting array vij as:

vij =uij − uminj

umaxj − uminji = 1 . . . n, j = 1 . . . p (4.2)

The result of the transformation is an n-by-p dataset matrix V = [vnp] which

will be used in the designed heuristic methods and algorithms. Also, we dene

time-to-event vector T = [tn] including all observed event times. Application

of a robustness validation method proposed by [12,4648], indicates that the

change of correlation between variables before and after transformation is

not signicant.

As a brute-force search algorithm outcome, the all subset models method

is the simplest and computationally consuming which generates all possible

subsets including combinations of p explanatory variables, totally 2p−1 sub-

set. This exhaustive method is not suitable for large numbers of variables.

Another approach is to develop models including combinations of the k vari-

ables, called selective, where 1 ≤ k < p. Consequently the total number of

models is given by:

83

p!

k! (p− k)! 2p − 1 (4.3)

For instance, considering a problem where p = 100 and k = 3, the total

number of subsets in exhaustive and selective models are 1048575 and 1114

respectively. Therefore nding a reasonable k is benecial [38]. For a com-

prehensive discussion for evaluation a desired number of variables in any

selected subset, k, see [48]. In any given dataset and before transformation,

the suggested criterion for this evaluation is based on eigenvalue of compo-

nents determined from principal component analysis (PCA) method. The

evaluated k is applied in heuristic randomized methods and algorithms to

select the most signicant ecient variables in order to regulate decision-

making process which are presented next. The following methods utilize

random subset selection concepts to evaluate coecient score of each vari-

able through AFT model by applying weighed least square technique.

4.3 Methods and Algorithms for Variable Clas-

sication

The preliminary step for the highest eciency in proposed methods is to

cluster the variables based on the correlation coecient matrix of original

dataset M = [mij], and choose a representative variable from each highly

correlated cluster, then eliminate the other variables from the dataset. Let

84

mij denote covariance of variables i and j. Pearson product-moment corre-

lation coecient matrix is dened as

mij =1

n− 1

n∑k=1

(vik − vi)(vjk − vj) i = 1 . . . p, j = 1 . . . p (4.4)

vik and vi in order represent value of variable i in observation k and mean

of variable i, and similarly second parenthesis is dened for variable j. For

instance, for any given dataset if three variables are highly correlated in M,

only one of them is selected randomly and the other two are eliminated from

the dataset. The outcome of this process assures that the remaining variables

for applying methods are not highly correlated.

A method is developed for the proposed normalization model to select e-

cient variables in right-censored time-to-event datasets with high-dimensional

explanatory variables. The basic assumption to design appropriate methods

for this purpose is that the variable which is completely inecient solely

can provide a signicant performance improvement when engaged with oth-

ers, and two variables that are inecient by themselves can be ecient to-

gether [44]. In these methods we use block randomization which is designed

to randomize subjects into equal sample sizes groups. Generally randomiza-

tion is the process to select or assign subjects with the same chance. Many

randomization procedures and methods have been proposed for the random

assignment; such as simple, replacement, block, and stratied [34].

85

To obtain variable coecients estimation, let win be the KM weights. For

i = 1 . . . n these weights can be calculated using the expression

win = S (lnTi )− S(lnTi−1 ) (4.5)

Considering the denition of KM estimator S as equation (2.11), win is ob-

tained [32] by

w1n =d1

n, w

in=

din− i+ 1

i−1∏j=1

(n− j

n− j + 1)di

i = 2 . . . n (4.6)

Recall AFT model 2.17, the estimator of β is

β = argminβ

n∑i=1

win[(ln Ti)− βXi ]2 (4.7)

The matrix calculation leads to an estimator solution as

β = (X′WX)

−1X′Wln T (4.8)

where X′is transpose of X. The equation 4.8 is used to estimate variable

coecients in heuristic decision making methods where X ≡ V.

Determining k as a number of observations in any subset of transformed

normalized data8set V, as described in Section 4.2. Where k ≤ n, then k -

by-p sub-dataset matrix D = [dkp] is dened as a selected subset of V. Also

vector B = [bk] will be utilized in following methods as a coecient estimator

vector corresponding to the explanatory variables for sub-dataset D.

86

In continue, this chapter provides heuristic randomized decision-making al-

gorithms for implementing variable selection through determining variable

eciency recognition procedure based on dened coecient score in right-

censored high-dimensional time-to-event data. The proposed methods utilize

random subset selection concept to evaluate coecient score of each variable

through accelerated failure time model by applying weighed least square tech-

nique.

4.3.1 Ranking Classication

The concept of Ranking Classication (RC) method is inspired by the semi-

greedy heuristic [14,20]. The objective of this method is to introduce ecient

variables in terms of correlation with survival time for an event time through

randomly splitting approach by selecting the best local subset using a coef-

cient score. This method is a ranking classication using semi-greedy ran-

domized procedure to split all explanatory variables into subset of predened

size k citesadeghzadeh2015nonparametric, sadeghzadeh2014multidisciplinary

using equal-width binning technique. The number of bins is equal to

c = dp/ke (4.9)

Note that if p is not divisible by k, then one bin contains less than k variables.

At each of l trials where l is a dened large reasonable number, a local

coecient score is calculated for any variable cluster at any bin. For i′th

87

cluster, this coecient score is given by

βi = (k∑j=1

(βj)2)

1/2

(4.10)

βj is obtained by weighted least squares estimation using 4.8 where X ≡ D.

As a semi-greedy approach, the largest calculated βi in each trial gives the

selected cluster including k variables and the coecient of these variables will

be counted into a cumulative coecient score for each of the cluster variables

respectively. To calculate this score summation for each variable over all tri-

als, a randomization dataset matrix Ξ = [ξck] is set where each row as a

bin is lled by k randomly selected variables' identication numbers in each

trial. Ranking classication will be applied at the end of l trials based on

accumulation of each variable score by vector Θ = [θp]. Heuristic algorithm

for the proposed method is following.

Let Θ = [θp] be the variable eciency rank vector where initially

each array as the cumulative contribution score corresponding to

a variable is zero

for i = 1 to l do

Randomly split the dataset V into equally sized c subsets

Compose Ξ by c subsets

for j = 1 to c do

Compose the sub-dataset Dj for variable subset j in Ξ including

88

variables ξjq for q = 1 to k

Calculate Wj over the sub-dataset Dj

Calculate coecient vector Bj over the sub-dataset Dj using

4.8

Calculate coecient score βj corresponding Bj over the

sub-dataset Dj using 4.10

end for

Select a subset with the highest βj, call this new subset D∗

for j = 1 to k do

Add the value of bij corresponding the variable ξij to the cumulative

coecient score θp of the variable q based on its identication

number = ξij

end for

end for

Normalize Θ = [θp]

Return Θ as variable eciency cumulative coecient score for ranking

4.3.2 Randomized Resampling Regression

The Randomized Resampling Regression (3R) is a classication method in-

spired by resampling methods which randomly selects a subset of size k out of

p variables from the transformed normalized dataset V, and estimates regres-

89

sion coecients of subset variables using weighted least squares estimation

described in Section 2.3. Then repeating the procedure for a dened large

reasonable number of trials, l [48], the accumulation of each variable coe-

cient over all trials gives a nal estimation score which leads to rank variables.

Cumulative coecient of an explanatory variable is an applicable estimate

over a high-dimensional time-to-event dataset. We dene a randomization

dataset matrix F = [flk] where each row is formed by k variables' identi-

cation numbers in any selected subsets and for overall l subsets. Heuristic

algorithm for the 3R method is:

Let Ω = [ωp] be the variable eciency rank vector where initially each

array as the cumulative

contribution score corresponding to a variable is zero

for i = 1 to l do

Compose the sub-dataset Di for variable subset i in F including

variables fij for j = 1 to k

Calculate Wi over the sub-dataset Di

Calculate coecient vector Bi over the sub-dataset Di using equation

4.10

for j = 1 to k do

Add the value of bij corresponding the variable fij to the cumulative

coecient score ωp of the variable q based on its identication number

= fij

90

end for

end for

Normalize Ω = [ωp]

Return Ω as variable eciency cumulative coecient score for ranking

The simulation experiment, results and analysis for RC method are following

in next section.

4.4 Simulation Experiment, Results and Anal-

ysis for RC Method

This section includes RC method verication by simulation experiment, and

also experimental result and analysis are presented.

4.4.1 Simulation Design and Experiment

To evaluate the performance and investigate the accuracy of the proposed

methods, simulation experiments are developed. We set n = 1000 observa-

tions, p = 25 variables. Four dierent simulation model are generated by

91

pseudorandom algorithms for i = 1 . . . p as

u1i ∼ U (−1, 1) , u2

i ∼ N (0, 1) , u3i ∼ Exp (1) , u4

i ∼ Weib (2, 1)

(4.11)

by uniform, normal, exponential, andWeibull distributions respectively, where

Uj, j = 1 . . . 4 is explanatory variable matrix including p variable vectors

represent the original dataset U for j′th model. In each simulation, the

coecient components used in 4.9 are set as

βii = 10, βjj = 1, βkk = 0.1,

ii = 1 . . . 5, jj = 6 . . . 15, kk = 16 . . . 25 (4.12)

Simulated event time vector Y is obtained from 2.17 where X ≡ Y and the

censoring indicator is generated using a uniform distribution. To improve

accuracy in calculating survival time vector t, an error vector is considered

as E ∼ N(0, 0.1) in each simulation. In order to obtain k as an estimation

of desired number of variables in any selected subset, principal component

analysis (PCA) is used, which gives k = 3.

All Four simulation models are applied in the proposed algorithm. A sample

of numerical results of simulation experiment when data is generated by the

second model U2 and applied in the proposed algorithm is depicted in Fig.

4.1. Variables with larger coecient scores β are more ecient and variables

with smaller scores are ideal candidate to be eliminated from the dataset if it

92

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

variable

norm

aliz

ed c

oeffi

cien

t sco

re

Figure 4.1: Normalized coecient score results for the simulation datasetfor 100 trials.

is desired. Each number represents a specic identity number of a variable.

4.4.2 Result and Analysis

A set of numerical simulation experiments are conducted to evaluate perfor-

mance of the proposed method. Four dened simulation models in (4.11)

are examined, and the result are presented in Table 4.1. These experiments

are also designed for n=1000 observations and p = 25 variables following the

93

same data generating algorithm. Number of signicant ecient variables de-

signed in each simulation model is set m∗ = 5 for each experiment. Each

simulation is repeated 200 times independently. For each method, m rep-

resents average number of determined and selected ecient variables as an

output of the proposed heuristic algorithm and p denotes performance of the

method, for 200 trials, where

p= mmethod/mdefinition = m/m∗ (4.13)

and p is the grand average of performance of each method. The adjusted

censoring rates of di for all numerical simulation experiments are 30± 10%.

Table 4.1: Performance of the proposed methods based on four numericalsimulation experiments with 200 replications.

#1 #2 #3 #4 AveMethod m p m p m p m p pRC 4.2 0.84 4.4 0.88 4.1 0.82 3.9 0.78 0.83

Denition 5 5 5 5

The robustness of this method is veried by replicated simulation experi-

ments leading to similar results. As an outcome of proposed methods, to

reduce the number of variables in the dataset for further analysis, it is rec-

ommended to eliminate those variables with lower rank. In addition, to

concentrate on a reduced number of variables, a category of highest ranked

variables is suggested.

94

Next, the simulation experiment, results and analysis for R3 method are

presented.

4.5 Simulation Experiment, Results and Anal-

ysis for 3R Method

In this section, R3 method is veried by simulation experiment where exper-

imental result and analysis are also discussed.

4.5.1 Simulation Design and Experiment

To evaluate the performance and investigate the accuracy of the proposed

methods, simulation experiments are developed. We set n = 500 observa-

tions, p = 20 variables. Three dierent simulation model are generated by

pseudorandom algorithms as

u1i ∼ U (−1, 1) , u2

i ∼ N (0, 1) , u3i ∼ Exp (1) , i = 1 . . . p (4.14)

by uniform, normal and exponential distributions respectively, where Uj. , j =

1...3 is explanatory variable matrix represent the original dataset U for j th

95

model. In each simulation the coecient components are set as

βii = 10, βjj = 1, βkk = 0.1, βll = 0,

ii = 1 . . . 5, jj = 6 . . . 10, kk = 11 . . . 15, ll = 16 . . . 20 (4.15)

Simulated event time vector Y is obtained from 2.17 and the censoring indica-

tor is generated using a uniform distribution. To improve accuracy in calcu-

lating time-to-event vector T, an error vector is considered as E ∼ N (0, 0.1)

in each simulation. In order to obtain k as an estimation of desired number

of variables in any selected subset, principal component analysis (PCA) is

used, which gives k = 3.

4.5.2 Result and Analysis

All three simulation models are applied in R3 method. A sample of numerical

results of simulation experiment when data is generated by all three models

4.14 and applied in 3R method is depicted in Fig. 4.2 and Fig. 4.3. Each

variable with larger score is more ecient and each variable with smaller

score is candidate to be eliminated from the dataset if it is desired. Each

number represents a specic identity number of a variable.

A set of numerical simulation experiments are conducted to evaluate per-

formance of the proposed methods and the results are presented in Table

4.2. Three dened simulation models in 4.14 are examined. Setting of these

96

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

variable

norm

aliz

ed c

oeffi

cien

t sco

re

Figure 4.2: Normalized coecient score results in 100 trials for the simula-tion dataset in Normal model.

models is as explained above; for n = 500 observations and p = 20 variables

following same data generating algorithm. Number of signicant ecient

variables in each simulation model is set m∗ = 5 for each experiment.

Each simulation is repeated 200 times independently. For each method, m

represents average number of determined and selected ecient variables and

p denotes performance of the method, for 200 trials, where

p= mmethod/mdefinition = m/m∗ (4.16)

97

0 2 4 6 8 10 12 14 16 18 200

0.5

1

0 2 4 6 8 10 12 14 16 18 200

0.5

1

norm

aliz

ed c

oeffi

cien

t sco

re

0 2 4 6 8 10 12 14 16 18 200

0.5

1

variable

Figure 4.3: Normalized coecient score results in 100 trials for the simula-tion dataset in Uniform model (top), Normal model (middle) and Exponen-tial model (bottom).

and p is the grand average of performance of each method. The adjusted

censoring rates for all numerical simulation experiments are 30± 10%.

The robustness of these methods is veried by replicated simulation exper-

iments leading to similar results. As an outcome of proposed methods, to

reduce the number of variables in the dataset for further analysis, it is rec-

ommended to eliminate those variables with lower rank. In addition, to

concentrate on a reduced number of variables, a category of highest ranked

98

Table 4.2: Performance of the proposed methods based on three numericalsimulation experiments with 200 replications.

#1 #2 #3 AveMethod m p m p m p pRC 4.3 0.86 4.4 0.88 4.2 0.84 0.863R 4.0 0.80 4.2 0.84 3.9 0.78 0.81

Denition 5 5 5

variables is suggested.

99

Chapter 5

Methodology III: Hybrid

Clustering and Classication

Methods using Normalized

Weighted K-Mean

5.1 Overview

In continuation of analyzing the variable selection methods in complex high-

dimensional time-to-event data, a new approach is to cluster variables

into subsets in terms of similarity in order to reduce the number of variables

100

through (a) substitute a representative for each subset, and (b) determine

the most ecient or inecient subsets benecial to the process of variable

reduction. An ecient variables is dened in previous methodologies [50].

For that goal, two hybrid methods are proposed under the assumption that

the datasets includes uncensored data with no missing array.

The matrix A= [aij] has i rows and j columns while the column vector a= [aj]

has i elements. Also aij denotes the element of matrix A in row i and column

j, and the j'th column of matrix A is denoted by aj. Additionally, a large

matrix can be composed from smaller ones, such as A= [B C].

The original dataset contains time-to-event vector with n components, t= [tn]

and may include any type of explanatory variable in matrix U= [unp] which

includes n independent observations and p explanatory variables.

5.2 Methods and Algorithms for Variable Clus-

tering

Two methods are presented in this section:

101

5.2.1 Clustering through Cost Function

The outline of this method is to obtain the best possible clustering for vari-

ables through a hybrid heuristic algorithm by using k-mean method. In this

method the cumulative results from reasonable number of trials gives an opti-

mum solution which clusters variables considering the eciency on the event

time.

An original given dataset D as an entry of the algorithms is shown in Table

5.1.

Table 5.1: High-dimensional time-to-event dataset

Obs. # t u1 u2 u3 u4 u5 u6 . . . up

1 t1 u11 u12 u13 u14 u15 u16 . . . u1p

2 t2 u21 u22 u23 u24 u25 u26 . . . u2p

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .n tn un1 un2 un3 un4 un5 un6 . . . unp

The dataset in matrix D is dened as below, where t= [tn] is time-to-event

vector (time vector) and U= [uij] is variable matrix:

D= [t U] (5.1)

Also the variable matrix U with p vectors of n components is:

U=[u1 u2 . . . up] (5.2)

102

Procedure for each trial is dened as below:

Step 1: The variable matrix U is normalized in the rst step. Minimum and

maximum value of each element in vector uj for j = 1 . . . p are obtained as:

uminj = min uij , umaxj = max uij i = 1 . . . n (5.3)

Then normalized elements vij for all uij are calculated:

vij =uij − uminj

umaxj − uminji = 1 . . . n, j = 1 . . . p (5.4)

The result of the normalization is a n-by-p matrix V = [vnp] called normal-

ized variable matrix. The normalized dataset is presented by:

DN= [t V] =[t v1 v2 . . . vp] (5.5)

Step 2: In the normalized matrix V, vectors v1 to vp are clustered randomly

into q clusters containing k variable vectors in each cluster where k = dp/qe.

For instance, a set of 12 variables is clustered into 4 clusters of 3 variable

vectors. Each cluster including k variable vectors which randomly assigned

to a variable vectors of matrix V. Fig. 5.1 depicts the randomly clustering

sample in this step.

Therefore, in general, each of cluster is represented by 5.6 where k is number

of vector per cluster.

103

Variables

Obs.# 𝒕 𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒗𝟒 𝒗𝟓 𝒗𝟔 … 𝒗𝒑

1 t1 v11 v12 v13 v14 u15 u16 … u1p 2 t2 v21 v22 v23 v24 u25 u26 … u2p … … … … … … … … … … n tn vn1 vn2 vn3 vn4 un5 un6 … unp

𝑾𝟏 𝑾𝟐 …

Figure 5.1: Randomly clustering process applies on matrix V.

Wj =[v1j v2

j . . . vkj

]j = 1 . . . q (5.6)

Superscripts in vectors show the members in each cluster and subscript de-

notes a cluster indicator. For example v23 is the second vector in the third

cluster and matrix W4 is the fourth cluster out of q cluster. Consequently

clustered dataset is presented in a new format:

DC=[t W1 W2 . . . Wq] (5.7)

Step 3: By dening transformation function (f), each cluster matrix Wj will

be substituted by a variable vector wj where i'th element of this vector is

resulted from applying transformation function on elements in i'th row in

the subject matrix Wj as below for i = 1 . . . n:

wij = f(v1ij, v

2ij, . . . , vkij) j = 1 . . . q (5.8)

This transformation is shown in Fig. 5.2:

104

𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒘𝟏 v11 v12 v13 w11 v21 v22 v23 w21 … … … …

vn1 vn2 vn3 wn1

𝑾𝟏

Figure 5.2: Sample transformation.

In the rst level of calculation, transformation function is dened as an arith-

metic mean where for i = 1 . . . n:

wij = (v1ij + v2

ij + · · ·+ vkij)/k j = 1 . . . q (5.9)

For example, rst element of the second transformed vector when k = 3 is

w12 = (v112 + v2

12 + v312)/3 (5.10)

This process makes the transformed dataset in the form of

DR = [t W] =[t w1 w2 . . . wq] (5.11)

as a matrix which is constructed by q + 1 vectors.

Step 4: In this step k-mean clustering algorithm is applied by selecting pa-

rameter m as the desired number of the clusters. This process is performed

105

to cluster observation considering time vector t into m clusters. As a result,

transformed dataset DR is split into m clusters:

Ci =[ti w1

i w2i . . . wq

i

]i = 1 . . .m (5.12)

Where ti is i'th cluster of time sub-vector and wji is a subset of vector wi

allocated into the i'th cluster. Fig. 5.3 shows the concept of this step.

Obs.# 𝒕 𝒘𝟏 𝒘𝟐 … 𝒘𝒒

𝑪𝟏 1 t1 w11 w12 … w1q 2 t2 w21 w22 … w2q

𝑪𝟐 3 t3 w31 w32 … w3q 4 t2 w41 w42 … w4q 5 t5 w51 w52 … w5q

6 t6 w61 w62 … w6q … … … … … … n tn wn1 wn2 … wnq

Figure 5.3: Sample of time-based clustering.

As an example, from Fig. 5.3, C1 =[t1 w1

1 w21 . . . wq

i

](Shaded cluster)

where t1 = [t1 t2]T , w11 = [w11 w21]T and so on.

Step 5: A potential function (y) with respect to each of m clusters is dened

over time data points. The objective of this function is to calculate the weight

of time in each cluster. In the i'th cluster (Ci) this function is dened as

below where tji indicates the j'th time element in the i'th cluster when j is

the number of time elements in each cluster.

106

yi = y(t1i , t2i , . . . t

ji ) (5.13)

In the rst level we dene this function as an arithmetic mean of time data

points in each cluster, equal to yi.

Step 6: Cost function over a trial is dened to calculate aggregation of po-

tential functions output y′s for all clusters in a trial as a criterion to compare

all trials. The objective of this denition is to determine variance in the

output of the potential functions based on time-to-event. The formula for

cost function is following:

Cost =m−1∑i=1

m∑j=i+1

|yi − yj| (5.14)

For each trial, Steps 1 to 6 gives a cost result. Comparing all costs, the

greatest amount introduces the best k-mean clustering which is a function

of randomly clustered variables. Finally the variable clustering of mentioned

trial is selected as the optimum clustering.

5.2.2 Clustering through Weight Function

The concept and objective of this method is similar to the Method A. The

procedure of the Method B is following:

Step 1: The original dataset D includes time-to-event vector t= [tn] and

107

variable matrix U= [unp] (Equation 1). The time vector t and variable matrix

U are normalized similar to Step 1 in Method A. The result is a normalized

time vector t∗ = [t∗n] and a normalized variable matrix V = [vnp]. The new

dataset is:

D∗= [t∗ V] =[t∗ v1 v2 . . . vp] (5.15)

Step 2: Normalized dataset D∗ is split into m bins based on one of following

techniques in terms of vector t :

1. K-M Estimator: Based on survival time

2. Equal Bin: Based on equal time intervals

3. Equal Frequency: Based on the frequency

Fig. 5.4 represents clustering schema.

Obs.# 𝒕∗ 𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒗𝟒 … 𝒗𝒑

𝑩𝟏 1 t*1 v11 v12 v13 v14 … v1p 2 t*2 v21 v22 v23 v24 … v2p

𝑩𝟐 3 t*3 v31 v32 v33 v34 … v3p 4 t*2 v41 v42 v43 v44 … v4p

5 t*5 v51 v52 v53 v54 … v5p … … … … … … … … n t*n vn1 vn2 vn3 vn4 … vnp

Figure 5.4: Schema of time-based binning.

108

Therefore,

Bi = [t∗i Vi] =[t∗i v1

i v2i . . . vp

i

]i = 1 . . .m (5.16)

Where t∗i is of time sub-vector in i'th bin and vji is a variable sub-vector of

vector vi allocated into the i'th bin. Vi is variable matrix including variable

elements in i'th bin (Bi) which is a sub-matrix of V. For instance, in the

shaded bin in Fig. 5.4, t∗2 = [t∗3, t∗4]T , v1

2 = [v31, v41]T .

As an example, let's assume a sample dataset of 9 observations (n = 9) in-

cluding time-to-event vector t = [10, 20, 30, 30, 50, 60, 70, 70, 90]T and

K-M survival function s = [1.00, 0.95, 0.80, 0.65, 0.55, 0.30, 0.25, 0.15, 0.00]T

as calculated, with respect to observations respectively. The results of bin-

ning into 3 bines (m = 3) bases on each of above techniques are presented in

Table 5.2:

Table 5.2: Sample of time-based binning based on three techniques.

K-M Estimator Equal Bin Equal FrequencyBin #1 1, 2, 3 1, 2, 3, 4 1, 2, 3Bin #2 4, 5 5, 6 4, 5, 6Bin #3 6, 7, 8, 9 7, 8, 9 7, 8, 9

Step 3: To calculate the weight of time in each cluster, weight function with

109

respect to each of m clusters is dened over time data points in each cluster.

This function in the rst level is dened as an arithmetic mean of time data

points in the cluster, zi. In Fig. 5.4, weight function of shaded bin #2 (B2)

including t∗3 and t∗4 is calculated as:

z2 = (t∗3 + t∗4)/2 (5.17)

Step 4: By multiplication each weight function result scalar (zi) in corre-

sponding bin variable matrix (Vi), weighted variable matrix V∗ is generated.

V∗i = ziVi i = 1 . . .m (5.18)

V∗T=[V∗T1 V∗T2 . . . V∗Tm

](5.19)

Step 5: Clustering algorithm k-mean applies on weighted variable matrix V∗

in terms of variables to cluster variable vectors in subsets.

5.3 Experiment and Result

Coding methods A and B by MATLAB shows robustness by tuning in pa-

rameter denition such as k and m.

A sample result for Method A applied on PBC dataset including 16 selected

110

variables and 111 uncensored observations are given in Table 5.3 for k = 4

and m = 5, for 100 trials:

Table 5.3: Clustering by cost function

Cluster # Variable #1 1, 2, 13, 152 3, 5, 7, 103 8, 9, 12, 144 4, 6, 11, 16

The result shows the four variable clusters obtained by cost function com-

parison.

111

Chapter 6

Conclusion

Capter 6 presents a summary of research and an overview for further

research steps.

6.1 Summary

Proper management and utilization of valuable data could signicantly in-

crease knowledge and reduce cost by preventive actions, whereas erroneous

and misinterpreted data could lead to poor inference and decision-making.

Variable selection is necessary in a decision-making application as unimpor-

tant variables adds unnecessary noise and often it is not clear which variables

would be most useful and cost-eective to collect. Therefore data collection

and analysis of data in data-driven decision-making plays crucial roles. In

112

many cases data are collected from dierent sources such as nancial reports

and marketing data, and they are combined for more informative decision-

making. Consequently, determining eective explanatory variables, speci-

cally in complex and high-dimensional data provides an excellent opportunity

to increase eciency and reduce costs. Unlike traditional datasets with few

explanatory variables, analysis of datasets with high number of variables re-

quires dierent approaches.

The proposed methods are benecial to explanatory variable selection in

decision-making process to obtain an appropriate subset of variables from

among a high-dimensional and large-scale time-to-event data. By using

such novel methods and reducing the number of variables, data analysis

and decision-making processes will be faster, simpler and more accurate. For

example, in business applications, many explanatory variables in a customer

survey are dened based on cause and eect analysis process data or simi-

lar analytic process outcome. In most cases, correlations of these variables

are complicated and unknown, and it is important to simply understand the

eciency of each variable in the survey. These procedures potentially are ap-

plicable solutions for many problems in a vast area of science and technology.

113

6.2 Next Steps

Next steps in this research are to examine the algorithms by exhaustive

method for robustness and use evolutionary algorithms to optimize the solu-

tion.

By denition, heuristic algorithms nd approximate solutions in acceptable

time and computational space. These algorithms either give nearly the right

answer or provide a solution not for all instances of the problem and usually

are used for problems that cannot be easily solved. An ecient heuristic

algorithm class is evolutionary algorithms which are methods that exploit

ideas of biological evolution, such as reproduction, mutation and recombi-

nation, for searching the solution of an optimization problem. They apply

the principle of survival on a set of potential solutions to produce gradual

approximations to the optimum. Evolutionary techniques can dier in the

details of implementation and the problems to which they are applied.

Also, the another challenge in this research is to face the data streaming cir-

cumstances in which variables are time-dependent where real-time analysis

of data is more complicated.

114

This is just the beginning ...

115

Bibliography

[1] Hervé Abdi and Lynne J Williams. Principal component analysis. Wiley

Interdisciplinary Reviews: Computational Statistics, 2(4):433459, 2010.

[2] D Addison, Stefan Wermter, and G Arevian. A comparison of fea-

ture extraction and selection techniques. In Proceedings of International

Conference on Articial Neural Networks (Supplementary Proceedings),

pages 212215, 2003.

[3] H Akaike. An information criterion (aic). Math Sci, 14(153):59, 1976.

[4] D.H. Bestereld. Total Quality Management. Prentice Hall, 1995.

[5] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen.

Classication and Regression Trees. CRC Press, 1984.

[6] Brad Brown, Michael Chui, and James Manyika. Are you ready for the

era of big data. McKinsey Quarterly, 4:2435, 2011.

[7] E Brynjolfsson. A revolution in decision-making improves productivity,

2012.

[8] Jonathan Buckley and Ian James. Linear regression with censored data.

Biometrika, 66(3):429436, 1979.

[9] Jerome R Busemeyer and James T Townsend. Decision eld theory: A

dynamic-cognitive approach to decision making in an uncertain environ-

ment. Psychological Review, 100(3):432, 1993.

116

[10] Junyi Chai, James NK Liu, and Eric WT Ngai. Application of decision-

making techniques in supplier selection: A systematic review of litera-

ture. Expert Systems with Applications, 40(10):38723885, 2013.

[11] David R Cox. Regression models and life-tables. Journal of the Royal

Statistical Society. Series B (Methodological), pages 187220, 1972.

[12] Nasser Fard and Keivan Sadeghzadeh. Heuristic ranking classication

method for complex large-scale survival data. In Modelling, Compu-

tation and Optimization in Information Systems and Management Sci-

ences, pages 4756. Springer, 2015.

[13] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data

into tiny data: Constant-size coresets for k-means, pca and projective

clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM

Symposium on Discrete Algorithms, pages 14341453. SIAM, 2013.

[14] Thomas A Feo and Mauricio GC Resende. Greedy randomized adaptive

search procedures. Journal of global optimization, 6(2):109133, 1995.

[15] Gartner. 2011.

[16] David A Garvin. Competing on the eight dimensions of quality. IEEE

Engineering Management Review, 24(1):1523, 1996.

[17] Michel Gendreau and Jean-Yves Potvin. Tabu search. In Search Method-

ologies, pages 165186. Springer, 2005.

[18] Jan J Gerbrands. On the relationships between svd, klt and pca. Pattern

Recognition, 14(1):375381, 1981.

[19] Isabelle Guyon and André Elissee. An introduction to variable and

feature selection. The Journal of Machine Learning Research, 3:1157

1182, 2003.

[20] J Pirie Hart and Andrew W Shogan. Semi-greedy heuristics: An empir-

ical study. Operations Research Letters, 6(3):107114, 1987.

117

[21] Joe Hellerstein. Parallel programming in the age of big data. Gigaom

Blog, 2008.

[22] Martin Hilbert and Priscila López. The world's technological capacity to

store, communicate, and compute information. science, 332(6025):60

65, 2011.

[23] Theodore R Holford et al.Multivariate Methods in Epidemiology. Oxford

University Press, 2002.

[24] Jian Huang, Shuangge Ma, and Huiliang Xie. Regularized estimation

in the accelerated failure time model with high-dimensional covariates.

Biometrics, 62(3):813820, 2006.

[25] IBM. Information integration and governance. 2011.

[26] IBM. What is big data? bringing big data to the enterprise. 2013.

[27] Hemant Ishwaran, Udaya B Kogalur, Eugene H Blackstone, and

Michael S Lauer. Random survival forests. The Annals of Applied Statis-

tics, pages 841860, 2008.

[28] I. Ishwaran and U.B. Kogalur. Random survival forests for R. 2007.

[29] Zhezhen Jin, DY Lin, and Zhiliang Ying. On least-squares regression

with censored data. Biometrika, 93(1):147161, 2006.

[30] Ian Jollie. Principal Component Analysis. Wiley Online Library, 2002.

[31] John D Kalbeisch and Ross L Prentice. The Statistical Analysis of

Failure Time Data, volume 360. John Wiley & Sons, 2011.

[32] Louis-Philippe Kronek and Anupama Reddy. Logical analysis of survival

data: prognostic survival models by detecting high-degree interactions

in right-censored data. Bioinformatics, 24(16):i248i253, 2008.

[33] Elisa T Lee and John Wang. Statistical Methods for Survival Data Anal-

ysis, volume 476. John Wiley & Sons, 2003.

118

[34] Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Dis-

cretization: An enabling technique. Data Mining and Knowledge Dis-

covery, 6(4):393423, 2002.

[35] Steve Lohr. The age of big data. New York Times, 11, 2012.

[36] Shuangge Ma, Michael R Kosorok, and Jason P Fine. Additive risk

models for survival data with high-dimensional covariates. Biometrics,

62(1):202210, 2006.

[37] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard

Dobbs, Charles Roxburgh, and Angela H Byers. Big data: The next

frontier for innovation, competition, and productivity. 2011.

[38] C. Matteo and G. Francesca. Variable selection methods: An introduc-

tion.

[39] Andrew McAfee, Erik Brynjolfsson, Thomas H Davenport, DJ Patil,

and Dominic Barton. Big data. The management revolution. Harvard

Business Review, 90(10):6167, 2012.

[40] Alexandre C Mendes and Nasser Fard. Accelerated failure time models

comparison to the proportional hazard model for time-dependent covari-

ates with recurring events. International Journal of Reliability, Quality

and Safety Engineering, 21(02):1450010, 2014.

[41] J Moran. Is big data a big problem for manufacturers, 2013.

[42] Hiroshi Motoda and Huan Liu. Feature selection, extraction and con-

struction. Communication of IICM (Institute of Information and Com-

puting Machinery, Taiwan), 5:6772, 2002.

[43] M. Nemscho. How big data can improve manufacturing quality, 2014.

[44] Arthur V Peterson Jr. Expressing the kaplan-meier estimator as a func-

tion of empirical subsurvival functions. Journal of the American Statis-

tical Association, 72(360a):854858, 1977.

119

[45] Philip Russom et al. Big data analytics. TDWI Best Practices Report,

Fourth Quarter, 2011.

[46] Keivan Sadeghzadeh and Nasser Fard. Multidisciplinary decision-

making approach to high-dimensional event history analysis through

variable reduction. European Journal of Economics and Management,

1(2):7689, 2014.

[47] Keivan Sadeghzadeh and Nasser Fard. Nonparametric data reduction

approach for large-scale survival data analysis. In Reliability and Main-

tainability Symposium (RAMS), 2015 Annual, pages 16. IEEE, 2015.

[48] Keivan Sadeghzadeh and Nasser Fard. Variable selection methods

for right-censored time-to-event data with high-dimensional covariates.

Journal of Quality and Reliability Engineering, 2015, 2015.

[49] Keivan Sadeghzadeh and Nasser Fard. Complex data classication in

weighted accelerated failure time model. IEEE, In Press.

[50] Keivan Sadeghzadeh, Nasser Fard, and Mohammad Bagher Salehi. An-

alytical clustering procedures in massive failure data. Applied Stochastic

Models in Business and Industry, In Press.

[51] Keivan Sadeghzadeh and Mohammad Bagher Salehi. Mathematical

analysis of fuel cell strategic technologies development solutions in

the automotive industry by the topsis multi-criteria decision making

method. International Journal of Hydrogen Energy, 36(20):1327213280,

2011.

[52] Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. A review of feature

selection techniques in bioinformatics. bioinformatics. Bioinformatics,

23(19):25072517, 2007.

[53] Gideon Schwarz et al. Estimating the dimension of a model. The Annals

of Statistics, 6(2):461464, 1978.

120

[54] Toby Segaran and Je Hammerbacher. Beautiful Data: The Stories

Behind Elegant Data Solutions. " O'Reilly Media, Inc.", 2009.

[55] Winfried Stute and J-L Wang. The strong law under random censorship.

The Annals of Statistics, pages 15911607, 1993.

[56] KP Suresh. An overview of randomization techniques: An unbiased as-

sessment of outcome in clinical research. Journal of Human Reproductive

Sciences, 4(1):8, 2011.

[57] P. Tan. Introduction to Data Mining. WP Co, 2006.

[58] A.R. Tenner and I.J. Detoro. Total Quality Management: Three Steps

to Continuous Improvement. Addison-Wesley.

[59] Hosmer D. W., Jr. and S. Lemeshow. Applied Survival Analysis: Re-

gression Modeling of Time to Event Data. Wiley, 1999.

[60] Fang Yao. Functional principal component analysis for longitudinal and

survival data. Statistica Sinica, 17(3):965, 2007.

[61] J. Ye. Unsupervised and Supervised Dimension Reduction: Algorithms

and Connections. Arizona State University.

[62] Donglin Zeng and DY Lin. Ecient estimation for the accelerated

failure time model. Journal of the American Statistical Association,

102(480):13871396, 2007.

[63] William R Zwick and Wayne F Velicer. Comparison of ve rules for

determining the number of components to retain. Psychological Bulletin,

99(3):432, 1986.

121

Appendices

122

Appendix A

Matlab Code

1 clc;2 clear;3 close all;4 commandwindow;5

6 %% ...========================================================================

7 %% part setting8 % ...

=========================================================================9 %%

10 % part 01: sda (survival data analysis)11 % part 02: svd (singular value decomposition)12 % part 03: pca (principal component analysis)13 % part 04: cox phm (cox proportional hazards model)14 % part 05: mlr (multiple linear regression model)15 % part 06: k-means (clustering)16 % part 07: rf (random forest)17 % part 08: on and off comparison (doe approach)18 % part 09: test score - brute-force (exhaustive) search (sda ...

combination)19 % part 10: test score - random subset selection (sda ...

randomization)20 % part 11: time score - brute-force (exhaustive) search (sda ...

combination)21 % part 12: time score - random subset selection (sda ...

randomization)22 % part 13: bisection semi-greedy clustering algorithm ...

(bisection method)23 % part 14: splitting semi-greedy clustering algorithm ...

(splitting method)

123

24 % part 15: weighted standardized score method25 % part 16: aft26 % part 17: penalty - cost function27 % part 18: hybrid algorithm28 % part 19: variable clustering and correlation29 % part 20: plots and charts30 % part 21: fuzzy time score - random subset selection31 % part 22: fuzzy weighted standardized score method32 % part 23: MIII133 % part 24: MIII234 % part no.: [01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 ...

17 18 19 20 21 22 23 24]35 part_code = [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...

0 0 0 0 0 0 0 0];36

37 %% ...========================================================================

38 %% data setting39 % ...

=========================================================================40 %%41 % data 01: data generator 0142 % data 02: data generator 0243 % data 03: data generator 0344 % data 04: pbc data45 % data 05: pharynx data46 % data 06: actg data47 % data 07: new data48 % data 08: gbcs data49 % data 09: va data50 % data 10: sample data51 % data 11: simulation data generator 0152 % data 12: simulation data generator 0253 % data 13: simulation data generator 0254 % data 14: sample data55 % data no.: [01 02 03 04 05 06 07 08 09 10 11 12 13 14]56 data_code = [ 0 0 0 1 0 0 0 0 0 0 0 0 0 0];57

58 % http://vincentarelbundock.github.io/Rdatasets/datasets.html59

60 %% ...========================================================================

61 %% censoring62 % ...

=========================================================================63 %%64 % data censoring 01: uncensored65 % data censoring 02: censored66 % data cens no.: [01 02]

124

67 data_cens_code = [ 1 1];68

69 %% ...========================================================================

70 %% explanatory variable data type71 % ...

=========================================================================72 %%73 % data type 01: logical74 % data type 02: normal75 % data type no.: [01 02]76 data_type_code = [ 1 1];77

78 %% ...========================================================================

79 %% calculation method80 % ...

=========================================================================81 %%82 % method 01: mid-range83 % method 02: mean84 % method 03: median85 % method 04:86 % method no.: [01 02 03 04]87 method_code = [ 1 0 0 0];88

89 %% ...========================================================================

90 %% ---------- data 01: data generator 01 ...----------------------------------

91 % ...=========================================================================

92 %%93 if data_code(01) == 194

95 data = [];96

97 data_generated = rand(100,10);98

99 x_km_all = rand(418,17);100 data = [data,pbc(:,1:3),x_km_all];101

102 end % if data_code(01) == 1103

104 %% ...========================================================================


106 % ...

125

=========================================================================107 %%108 if data_code(02) == 1109

110 data = [];111 n = 400;112 lower = 1;113 upper = 5;114

115 data(1:n,1) = 1:n;116 % data(1:n,2) = sortrows(fix(1000*rand(n,1)));117 data(1:100,2) = sortrows(fix(100*rand(100,1)));118 data(101:200,2) = sortrows(fix(100*rand(100,1)+200));119 data(201:300,2) = sortrows(fix(100*rand(100,1)+400));120 data(301:400,2) = sortrows(fix(100*rand(100,1)+600));121

122 data(1:n,3) = 1;123

124 data(1:n,4) = 0; % data(n,4) = 0.1;125 data(1:n,5) = 1; % data(n,4) = 0.9;126 data(1:n,6) = 2*data(1:n,1) + 2*rand(n,1); % a(ID) + e127 data(1:n,7) = -3*data(1:n,1) + -3*rand(n,1); % a(ID)^2 + e128 data(1:n,8) = 4*data(1:n,2) + 4*rand(n,1); % a(T) + e129 data(1:n,9) = -5*data(1:n,2) + -5*rand(n,1); % a(T)^2 + e130 data(1:n,10) = sign(mod((data(1:n,1)-1 - ...

mod(data(1:n,1)-1,5)),10)); % 0 and 1 in order for ...every 5 cons. obs.

131 data(1:n,11) = randn(n,1); % normal random N(0,1)132 data(1:n,12) = 1 + 2.*randn(n,1); % normal random N(0,1)133 data(1:n,13) = normrnd(0,1,n,1); % normal random N(0,1)134 data(1:n,14) = randi([lower upper],n,1); % integer ...

uniform random (1 to upper)135 data(1:n,15) = lower + (upper-lower).*rand(n,1); % ...

uniform random (lower to upper)136 data(1:n,16) = rand(n,1); % uniform random (0 to 1)137 data(1:n,17) = rand(n,1)≤0.1; % 0 for 90% and 1 for 10%138 data(1:n,18) = rand(n,1)≤0.5; % 0 for 50% and 1 for 50%139 data(1:n,19) = rand(n,1)≤0.9; % 0 for 10% and 1 for 90%140 data(1:n,20) = random('unif',1,10,n,1); % uniform random ...

U(2)141 data(1:n,21) = random('norm',2,3,n,1); % normal random ...

N(2,3)142 data(1:n,22) = random('exp',2,n,1); % exponential random ...

E(2)143 data(1:n,23) = random('wbl',2,3,n,1); % weibull random ...

W(2,3)144

145 % http://www.mathworks.com/help/stats/random.html146

126

147 end % if data_code(02) == 1148

149 %% ...========================================================================


151 % ...=========================================================================

152 %%153 if data_code(03) == 1154


160 data(1:n,1) = 1:300;161 % data(1:n,2) = sortrows(fix(1000*rand(n,1)));162 data(1:100,2) = sortrows(fix(100*rand(100,1)));163 data(101:200,2) = sortrows(fix(100*rand(100,1)+200));164 data(201:300,2) = sortrows(fix(100*rand(100,1)+400));165

166 data(1:n,3) = 1;167

168 data(1:n,4) = 0; % data(n,4) = 0.1;169 data(1:n,5) = 1; % data(n,4) = 0.9;170 data(1:n,6) = 2*data(1:n,1) + 2*rand(n,1); % a(ID) + e171 data(1:n,7) = -3*data(1:n,1) + -3*rand(n,1); % a(ID)^2 + e172 data(1:n,8) = 4*data(1:n,2) + 4*rand(n,1); % a(T) + e173 data(1:n,9) = -5*data(1:n,2) + -5*rand(n,1); % a(T)^2 + e174 data(1:n,10) = sign(mod((data(1:n,1)-1 - ...

mod(data(1:n,1)-1,5)),10)); % 0 and 1 in order for ...every 5 cons. obs.

175 data(1:n,11) = randn(n,1); % normal random N(0,1)176 data(1:n,12) = 1 + 2.*randn(n,1); % normal random N(0,1)177 data(1:n,13) = normrnd(0,1,n,1); % normal random N(0,1)178 data(1:n,14) = randi([lower upper],n,1); % integer ...

uniform random (1 to upper)179 data(1:n,15) = lower + (upper-lower).*rand(n,1); % ...

uniform random (lower to upper)180 data(1:n,16) = rand(n,1); % uniform random (0 to 1)181 data(1:n,17) = rand(n,1)≤0.1; % 0 for 90% and 1 for 10%182 data(1:n,18) = rand(n,1)≤0.5; % 0 for 50% and 1 for 50%183 data(1:n,19) = rand(n,1)≤0.9; % 0 for 10% and 1 for 90%184 data(1:n,20) = random('unif',1,10,n,1); % uniform random ...

U(2)185 data(1:n,21) = random('norm',2,3,n,1); % normal random ...

N(2,3)186 data(1:n,22) = random('exp',2,n,1); % exponential random ...

127

E(2)187 data(1:n,23) = random('wbl',2,3,n,1); % weibull random ...

W(2,3)188


191 end % if data_code(03) == 1192

193 %% ...========================================================================

194 %% ---------- data 04: pbc data ...-------------------------------------------

195 % ...=========================================================================

196 %%197 if data_code(04) == 1198

199 load 'pbc';200

201 data = pbc;202

203 % N Case number.204 % X The number of days between registration and the ...

earlier of death,liver transplantation,or study ...analysis time in July,1986.

205 % D 1 if X is time to death,0 if time to censoring206

207 % Z1 Treatment Code,1 = D-penicillamine,2 = placebo.208 % Z2 Age in years. For the first 312 cases,age was ...

calculated by209 % dividing the number of days between birth and ...

study registration by 365.210 % Z3 Sex,0 = male,1 = female.211 % Z4 Presence of ascites,0 = no,1 = yes.212 % Z5 Presence of hepatomegaly,0 = no,1 = yes.213 % Z6 Presence of spiders 0 = no,1 = Yes.214 % Z7 Presence of edema,0 = no edema and no diuretic ...

therapy for edema; 0.5 = edema present for which no ...diuretic therapy was given,or

215 % edema resolved with diuretic therapy; 1 = edema ...despite diuretic therapy

216 % Z8 Serum bilirubin,in mg/dl.217 % Z9 Serum cholesterol,in mg/dl.218 % Z10 Albumin,in gm/dl.219 % Z11 Urine copper,in mg/day.220 % Z12 Alkaline phosphatase,in U/liter.221 % Z13 SGOT,in U/ml.222 % Z14 Triglycerides,in mg/dl.223 % Z15 Platelet count; coded value is number of ...

128

platelets per-cubic-milliliter of blood divided by 1000.224 % Z16 Prothrombin time,in seconds.225 % Z17 Histologic stage of disease,graded 1,2,3,or 4.226

227 end % if data_code(04) == 1228

229 %% ...========================================================================

230 %% ---------- data 05: pharynx data ...------------------------------------------

231 % ...=========================================================================

232 %%233 if data_code(05) == 1234

235 load 'pharynx';236

237 data = pharynx;238

239 % CASE Case Number240 % TIME Survival time in days from day of diagnosis241 % STATUS 0=censored, 1=dead242 % INST Participating Institution243 % SEX 1=male, 2=female244 % TX Treatment: 1=standard, 2=test245 % GRADE 1=well differentiated, 2=moderately ...

differentiated, 3=poorly differentiated, 9=missing246 % AGE In years at time of diagnosis247 % COND Condition: 1=no disability, 2=restricted ...

work, 3=requires assistance with self care, 4=bed ...confined, 9=missing

248 % SITE 1=faucial arch, 2=tonsillar fossa, ...3=posterior pillar, 4=pharyngeal tongue, 5=posterior wall

249 % T_STAGE 1=primary tumor measuring 2 cm or less in ...largest diameter, 2=primary tumor measuring 2 cm to 4 ...cm in largest diameter with

250 % minimal infiltration in depth, 3=primary ...tumor measuring more than 4 cm, 4=massive invasive tumo

251 % N_STAGE 0=no clinical evidence of node metastases, ...1=single positive node 3 cm or less in diameter, not ...fixed, 2=single positive

252 % node more than 3 cm in diameter, not fixed, ...3=multiple positive nodes or fixed positive nodes

253 % ENTRY_DT Date of study entry: Day of year and year, dddyy254

255 end % if data_code(05) == 1256

257 %% ...========================================================================

129

258 %% ---------- data 06: actg data ...------------------------------------------

259 % ...=========================================================================

260 %%261 if data_code(06) == 1262

263 load 'actg';264

265 data = actg;266

267 end % if data_code(06) == 1268

269 %% ...========================================================================

270 %% ---------- data 07: new data ...-------------------------------------------

271 % ...=========================================================================

272 %%273 if data_code(07) == 1274

275 load 'new';276

277 data = gbcs;278

279 end % if data_code(07) == 1280

281 %% ...========================================================================

282 %% ---------- data 08: gbcs data ...------------------------------------------

283 % ...=========================================================================

284 %%285 if data_code(08) == 1286

287 load 'gbcs';288

289 data = gbcs;290

291 end % if data_code(08) == 1292

293 %% ...========================================================================

294 %% ---------- data 09: va data ...--------------------------------------------

295 % ...=========================================================================

130

296 %%297 if data_code(07) == 1298

299 load 'va';300

301 data = va;302

303 end % if data_code(09) == 1304

305 %% ...========================================================================

306 %% ---------- data 10: sample data ...----------------------------------------

307 % ...=========================================================================

308 %%309 if data_code(10) == 1310

311 load 'sample01';312 load 'sample02';313 load 'sample03';314 load 'sample04';315 load 'sample05';316

317 data = sample05;318

319 end % if data_code(10) == 1320

321 %% ---------- data 11: simulation data generator 01 ...-----------------------

322 % ...=========================================================================

323 %%324 if data_code(11) == 1325


331 data(1:n,1) = 1:1000;332 % data(1:n,2) = sortrows(fix(1000*rand(n,1)));333

334 data(1:n,3) = 1;335

336 data(1:n,4) = normrnd(0,1,n,1) %+ data(1:n,1); % normal ...random N(0,1)


131




341 data(1:n,9) = normrnd(0,1,n,1) % normal random N(0,1)342 data(1:n,10) = normrnd(0,1,n,1) % normal random N(0,1)343 data(1:n,11) = normrnd(0,1,n,1) % normal random N(0,1)344 data(1:n,12) = normrnd(0,1,n,1) % normal random N(0,1)345 data(1:n,13) = normrnd(0,1,n,1) % normal random N(0,1)346 data(1:n,14) = rand(n,1); % uniform random (0 to 1)347 data(1:n,15) = rand(n,1); % uniform random (0 to 1)348 data(1:n,16) = rand(n,1); % uniform random (0 to 1)349 data(1:n,17) = rand(n,1); % uniform random (0 to 1)350 data(1:n,18) = rand(n,1); % uniform random (0 to 1)351 data(1:n,19) = random('exp',1,n,1); % exp random (1)352 data(1:n,20) = random('exp',1,n,1); % exp random (1)353 data(1:n,21) = random('exp',1,n,1); % exp random (1)354 data(1:n,22) = random('exp',1,n,1); % exp random (1)355 data(1:n,23) = random('exp',1,n,1); % exp random (1)356 data(1:n,24) = random('wbl',2,1,n,1); % exp random (2)357 data(1:n,25) = random('wbl',2,1,n,1); % exp random (2)358 data(1:n,26) = random('wbl',2,1,n,1); % exp random (2)359 data(1:n,27) = random('wbl',2,1,n,1); % exp random (2)360 data(1:n,28) = random('wbl',2,1,n,1); % exp random (2)361

362 for i=1:n363 betha1x = sum(data(i,4:8));364 betha2x = sum(data(i,9:13));365 betha3x = sum(data(i,14:18));366 betha4x = sum(data(i,19:28));367 data(i,2) = 10*betha1x + 1*betha2x + 0.1*betha3x + ...

0*betha4x;368 end369


372 end % if data_code(11) == 1373

374 %% ...========================================================================


376 % ...=========================================================================

377 %%378 if data_code(12) == 1379

132


385 data(1:n,1) = 1:1000;386 % data(1:n,2) = sortrows(fix(1000*rand(n,1)));387

388 data(1:n,3) = 1;389

390 for i = 4:23391 data(1:n,i) = normrnd(0,1,n,1); % normal random N(0,1)392 % data(1:n,i) = rand(n,1); % uniform random (0 to 1)393 % data(1:n,i) = random('exp',1,n,1); % ex;p random (1)394 % data(1:n,i) = random('wbl',2,1,n,1); % exp random (2)395 end396

397 % data(1:n,i) = rand(n,1); % uniform random (0 to 1)398 % data(1:n,i) = random('exp',1,n,1); % exp random (1)399 % data(1:n,i) = random('wbl',2,1,n,1); % exp random (2)400

401

402 for i = 1:n403 betha1x = sum(data(i,4:8));404 betha2x = sum(data(i,9:13));405 betha3x = sum(data(i,14:18));406 betha4x = sum(data(i,19:23));407 % betha4x = sum(data(i,19:28));408 data(i,2) = 10*betha1x + 1*betha2x + 0.1*betha3x + ...

0*betha4x;409 end410


413 end % if data_code(12) == 1414

415 %% ...========================================================================


417 % ...=========================================================================

418 %%419 if data_code(13) == 1420

421 end % if data_code(13) == 1422

423 %% ...========================================================================

133


425 % ...=========================================================================

426 %%427 if data_code(14) == 1428

429 data = [];430

431 data_generated = rand(100,10);432

433 x_km_all = rand(418,17);434 data = [data,pbc(:,1:3),x_km_all];435

436 end % if data_code(14) == 1437

438 %%439 %% ...

========================================================================440 %% general setting441 % ...

=========================================================================442 %%443 data(any(isnan(data),2),:)=[]; % remove NaN data [data cleaning]444

445 data_unc = data; % defining uncensored dataset matrix446 data_unc(find(data_unc(:,3)==0),:) = []; % remove censored data447 data_unc = sortrows(data_unc,2); % sort by failure time448

449 data_cens = data; % defining censored dataset matrix450 data_cens = sortrows(data_cens,2); % sort by failure time451

452 if data_cens_code(01) == 1453 [n p] = size(data_unc);454 data_cal = data_unc; % defining calculate dataset matrix455 end456

457 if data_cens_code(02) == 1458 [n p] = size(data_cens);459 data_cal = data_cens; % defining calculate dataset matrix460 end461

462 data_cal(n+1,:) = min(data_cal(1:n,:)); % calculate variable min463 data_cal(n+2,:) = max(data_cal(1:n,:)); % calculate variable max464

465 if method_code(01) == 1; data_cal(n+3,:) = (data_cal(n+1,:) ...+ data_cal(n+2,:))/2; end % calculate variable "mid-range"

466 if method_code(02) == 1; data_cal(n+3,:) = ...mean(data_cal(1:n,:)); end; % calculate variable "mean"

134

467 if method_code(03) == 1; data_cal(n+3,:) = ...median(data_cal(1:n,:)); end; % calculate variable "median"

468

469 data_raw = data_cal(1:n,4:p); % defining raw dataset matrix470 data_nor = zscore(data_raw); % defining normalized dataset ...

matrix471

472 % logical vector473

474 data_log = data_cal; % defining logical dataset matrix475 data_fuz = data_cal; % defining logical dataset matrix476

477 for i = 1:n478 for j = 4:p479 if data_log(i,j) < data_log(n+3,j);480 data_log(i,j) = 0;481 else482 data_log(i,j) = 1;483 end484 end485 end486

487 for i = 1:n488 for j = 4:p489 if data_fuz(n+1,j)== data_fuz(n+2,j);490 data_fuz(i,j) = data_fuz(n+1,j);491 else492 data_fuz(i,j) = ...

(data_fuz(i,j)-data_fuz(n+1,j))/(data_fuz(n+2,j)-data_fuz(n+1,j));493 end494 end495 end496

497 if data_type_code(01) == 1498 data_time = data_log(1:n,2); % defining failure time vector499 data_var = data_log(1:n,4:p); % defining logical ...

variable dataset matrix500 end501

502 if data_type_code(02) == 1503 data_time = data_fuz(1:n,2); % defining failure time vector504 % data_var = data_fuz(1:n,4:p); % defining logical ...

variable dataset matrix505 data_fuzzy = data_fuz(1:n,4:p); % defining logical ...

variable dataset matrix506 end507

508 % extra change for var 1 (switching) % data_var(:,1) = ...sign(data_var(:,1)-1);

135

509

510 var(p-3) = ...struct('all',[],'one',[],'zero',[],'failure',[],'surv',[]);

511

512 for i = 1:p-3513 [var(i).all(:,1)] = data_log(1:n,2); % time514 [var(i).all(:,2)] = data_log(1:n,3); % status515 [var(i).all(:,3)] = data_log(1:n,i+3); % variable (0 or 1)516

517 [var(i).one(:,1)] = data_log(1:n,2); % time518 [var(i).one(:,2)] = data_log(1:n,3); % status519 [var(i).one(:,3)] = data_log(1:n,i+3); % variable (only 1)520 var(i).one(find([var(i).one(1:end-1,3)]==0),:) = []; % ...

compeleting var(i).one (eleminating 0s)521 var_one_sum(i,1) = sum(var(i).one(:,1)); % cumulative time522 var_one_sum(i,2) = sum(var(i).one(:,1)); % cumulative ones523

524 [var(i).zero(:,1)] = data_log(1:n,2); % time525 [var(i).zero(:,2)] = data_log(1:n,3); % status526 [var(i).zero(:,3)] = data_log(1:n,i+3); % variable (only 0)527 var(i).zero(find([var(i).zero(1:end-1,3)]==1),:) = []; % ...

compeleting var(i).zero (eleminating 1s)528 var_zero_sum(i,1) = sum(var(i).zero(:,1)); % cumulative time529 var_zero_sum(i,2) = sum(var(i).zero(:,3)); % cumulative ...

zeros530

531 % [var(i).failure(:,1),var(i).failure(:,2)] = ...ecdf(var(i).one(:,1));

532 [var(i).failure(:,1),var(i).failure(:,2)] = ...ecdf(var(i).one(:,1),'censoring',var(i).one(:,2));

533 var(i).surv(:,1) = 1 - var(i).failure(:,1); % survival ...function for single variable #i

534 end535

536 % data_var_num = data_var;537 data_var_num = data_fuzzy;538 for i = 1:p-3539 data_var_num(n+1,i) = i;540 end541

542 %% model validation - Comparison of covariate correlations ...in the original and the transformed dataset

543

544 corrcoef_raw = corrcoef(data_raw); % ...++++++++++++++++++++++++++++++++++++++

545 corrcoef_var = corrcoef(data_var); % ...++++++++++++++++++++++++++++++++++++++

546

547 % corrplot(data_raw) % ...

136

++++++++++++++++++++++++++++++++++++++++++++++++++++548 % corrplot(data_var) % ...

++++++++++++++++++++++++++++++++++++++++++++++++++++549

550 x_cor = [];551 y_cor = [];552 z_cor = [];553

554 for i = 1:p-3555 scatter(corrcoef_raw(i,i+1:end),corrcoef_var(i,i+1:end),'k');556 x_cor = [x_cor, corrcoef_raw(i,i+1:end)];557 y_cor = [y_cor, corrcoef_var(i,i+1:end)];558 hold on; grid on;559 end560 [p_cor,s_cor] = polyfit(x_cor,y_cor,1); % poly fit561 z_cor = polyval(p_cor,x_cor);562 plot(x_cor,z_cor,'--','Color',[0.5 0.5 0.5]);563 hold on; grid on;564 xlabel('original dataset','FontSize',15); ...

ylabel('transformed dataset','FontSize',15); hold on565 axis([-.6 .6 -.6 .6]);566 axis square;567

568 x_axis = [1:p-3];569

570 %% ...========================================================================

571 %% ...========================================================================

572 %% ...========================================================================

573 %% ...========================================================================

574 %% ---------- parts ...-------------------------------------------------------

575 % ...=========================================================================

576

577 %% ...========================================================================

578 %% ---------- part 01: sda (survival data analysis) ...-----------------------

579 % ...=========================================================================

580 %%581 if part_code(01) == 1582

583 % ...http://www.mathworks.com/help/stats/kaplan-meier-methods.html

137

584 % ...http://www.mathworks.com/help/stats/examples/analyzing-survival-or-reliability-data.html

585

586 % The Wilcoxon rank sum test is equivalent to the ...Mann-Whitney U test.

587

588 % [P,H] = ranksum(...) returns the result of the ...hypothesis test, performed at the 0.05 significance ...level, in H.

589 % H=0 indicates that the null hypothesis ("medians are ...equal") cannot be rejected at the 5% level.

590 % H=1 indicates that the null hypothesis can be rejected ...at the 5% level.

591

592 % y = data_unc(:,2)'; % or data_time'593 y_data = data_cens(:,2)'; % or data_time'594 % freqy = data_unc(:,3)';595 censy = data_cens(:,3)';596 freqy = censy;597 % [f,x] = ecdf(y,'frequency',freqy);598 % [f_data,x_data] = ...

ecdf(y_data,'frequency',freqy,'function','survivor');599 [f_data,x_data] = ...

ecdf(y_data,'censoring',censy,'function','survivor');600 data_surv = f_data;601 % data_time = f_data;602

603 figure;604 % ecdf(y_data,'frequency',freqy,'function','survivor'); ...

% kaplan-meier plot605 ecdf(y_data,'censoring',censy,'function','survivor'); % ...

kaplan-meier plot606 hold on; grid on;607 title('kaplan-meier survival function','FontSize',15);608 xlabel('time,t','FontSize',15); ylabel('survival ...

probability,s(t)','FontSize',15);609 % plot(x,s,'color','r'); hold on; grid on; % survival plot610 h = get(gca,'children');611 set(h,'color','k');612

613 % ........................................................................

614 % uncensored data615 y_unc = data_unc(:,2)'; % or data_time'616 freqy_unc = data_cens(:,3)';617 [f_unc,x_unc] = ...

ecdf(y_data,'frequency',freqy_unc,'function','survivor');618 % data_surv = f_unc;619

138

620 % figure;621 ecdf(y_data,'frequency',freqy_unc,'function','survivor'); ...

% kaplan-meier plot622 hold on; grid on;623 title('kaplan-meier survival function ...

(uncensored)','FontSize',15);624 xlabel('time,t','FontSize',15); ylabel('survival ...

probability,s(t)','FontSize',15);625 % plot(x,s,'color','r'); hold on; grid on; % survival plot626 % h = get(gca,'children');627 % set(h,'color','g');628 % ...

.....................................................................629

630 [p_data,s_data] = polyfit(x_data,data_surv,1); % poly fit631 z = polyval(p_data,x_data);632 % plot(x,s,'o',x,z,'-'); hold on; grid on;633 plot(x_data,z,'-','Color','m');634 hold on; grid on;635 axis([0 data_cal(n,2) 0 1]);636

637 slop = [];638

639 figure;640 for i= 1:p-3641 yi = var(i).one(:,1)'; % time642 censyi = var(i).one(:,2)'; % status643 [fi,xi] = ...

ecdf(yi,'censoring',censyi,'function','survivor');644 data_surv_i = fi; % survival function for single ...

variable #1 equal to var(i).surv(:,1)645

646 ecdf(yi,'censoring',censyi,'function','survivor'); % ...kaplan-meier plot

647 hold on; grid on;648 title('kaplan-meier survival function','FontSize',15);649 xlabel('time,t','FontSize',15); ylabel('survival ...

probability,s(t)','FontSize',15); hold on;650 % plot(x,data_surv,'Color','k'); hold on; % survival ...

plot651

652 [pi,si] = polyfit(xi,data_surv_i,1); % poly fit653 z = polyval(pi,xi);654 slop(i,1) = p_data(1,1) - pi(1,1);655 plot(xi,z,'-','Color','g');656 hold on; grid on;657 % [p_rank(i),h_rank(i)] = ...

ranksum(data_time,var(i).one(:,1)); % wilcoxon ...rank sum test

139

658 [p_rank(i),h_rank(i)] = ...ranksum(data_surv,data_surv_i); % wilcoxon rank ...sum test

659 end660

661 axis([0 data_cal(n,2) 0 1]);662 xlabel('time,t','FontSize',15); ylabel('survival ...

probability,s(t)','FontSize',15);663

664 figure;665 bar(x_axis,h_rank); % 0 or 1666 hold on; grid off;667 title('preliminary variable importance - wilcoxon rank ...

test','FontSize',15);668 xlabel('variable','FontSize',15); ...

ylabel('importance','FontSize',15); hold on;669

670 figure;671 bar(x_axis,1-p_rank,'g'); % 0 to 1672 hold on; grid on;673 title('preliminary variable importance - wilcoxon rank ...

test','FontSize',15);674 xlabel('variable','FontSize',15); ...

ylabel('importance','FontSize',15); hold on;675 % axis([0 17 -1 2])676

677 figure;678 bar(x_axis,slop,'r'); % plot(slop,'r')679 hold on; grid on;680 title('slop of the survival function','FontSize',15);681 xlabel('variable','FontSize',15); ...

ylabel('slop','FontSize',15); hold on;682

683 end % if part_code(01) == 1684

685 %% ...========================================================================

686 %% ---------- part 02: svd (singular value decomposition) ...-----------------

687 % ...=========================================================================

688 %%689 if part_code(02) == 1690

691 % [U,S,V] = svd(X) produces a diagonal matrix S of the ...same dimension as X,with nonnegative diagonal ...elements in decreasing order,and unitary matrices U ...and V so that X = U*S*V'.

692 % [U,S,V] = svd(X,0) produces the "economy size" ...

140

decomposition. If X is m-by-n with m > n,then svd ...computes only the first n columns of U and S is n-by-n.

693 % [U,S,V] = svd(X,'econ') also produces the "economy ...size" decomposition. If X is m-by-n with m ≥ n,it is ...equivalent to svd(X,0). For m < n,only the first m ...columns of V are computed and S is m-by-m.

694 % http://www.mathworks.com/help/matlab/ref/svd.html695

696 [U1,S1,V1] = svd(data_raw); % svd : data_raw = U*S*V'697 [U2,S2,V2] = svd(data_var); % svd : data_var = U*S*V'698

699 figure;700 bar(x_axis,diag(S1));701 hold on; grid on;702 title('singular value decomposition','FontSize',15);703 xlabel('variable','FontSize',15); ...

ylabel('eigenvalue','FontSize',15);704 set(gca,'FontSize',15);705

706 figure;707 pareto(diag(S1));708 hold on; grid on;709 title('singular value decomposition','FontSize',15);710 xlabel('variable','FontSize',15); ...

ylabel('eigenvalue','FontSize',15);711

712 figure;713 bar(x_axis,diag(S2));714 hold on; grid on;715 title('singular value decomposition','FontSize',15);716 xlabel('variable','FontSize',15); ...


718 figure;719 pareto(diag(S2));720 hold on; grid on;721 title('singular value decomposition','FontSize',15);722 xlabel('variable','FontSize',15); ...


724 figure;725 bar(x_axis,[diag(S1) diag(S2)],'stack');726 hold on; grid on;727 title('singular value decomposition raw - ...

log','FontSize',15); hold on;728 xlabel('criterion 1','FontSize',15); ylabel('criterion ...

2','FontSize',15); hold on;729 legend('criterion 1','criterion 2');730 set(gca,'FontSize',15);

141

731

732 figure;733 bar(x_axis,1:p-3,[diag(S1) diag(S2)],0.5,'grouped');734 hold on; grid on;735 title('singular value decomposition - raw - ...

log','FontSize',15); hold on;736 xlabel('criterion 1','FontSize',15); ylabel('criterion ...

2','FontSize',15); hold on;737 legend('criterion 1','criterion 2');738 set(gca,'FontSize',15);739

740 U3 = U1; U3(:,3:111)=0;741 S3 = S1; S3(:,3:17)=0;742 V3 = V1; V3(3:17,:)=0;743 DATA_SVD = U3*S3*V3';744

745 end % if part_code(2) == 1746

747 %% ...========================================================================

748 %% ---------- part 03: pca (principal component analysis) ...-----------------

749 % ...=========================================================================

750 %%751 if part_code(03) == 1752

753 % [coeff,score,latent,tsquared,explained,mu] = ...pca(X,Name,Value)returns the principal component ...coefficients for the n-by-p data matrix X. Rows of X ...correspond to observations and columns correspond to ...variables.

754 % The coefficient matrix is p-by-p. Each column of coeff ...contains coefficients for one principal component,and ...the columns are in descending order of component ...variance. By default,pca centers the data and uses ...the singular value decomposition algorithm.

755 % Also returns any of the output arguments in the ...previous syntaxes using additional options for ...computation and handling of special data ...types,specified by one or more Name,Value pair arguments,

756 % and returns the principal component scores in score ...and the principal component variances in latent. ...Principal component scores are the representations of ...X in the principal component space. Rows of score ...correspond to observations,and columns correspond to ...components.

757 % The principal component variances are the eigenvalues ...of the covariance matrix of X.

142

758 % In addition returns the Hotelling's T-squared ...statistic for each observation in X and explained,the ...percentage of the total variance explained by each ...principal component and mu,the estimated mean of each ...variable in X.

759 % http://www.mathworks.com/help/stats/pca.html760

761 [coeff_pcacov_raw,latent_pcacov_raw] = pcacov(data_raw);762 [coeff_pcacov_var,latent_pcacov_var] = pcacov(data_var);763

764 for i = 1:2765 figure;766 if i == 1;767 % eigenvalues_pcacov = latent_pcacov_raw; h = ...

title('scree plot - pcacov - raw'); hold on;768 eigenvalues_pcacov = latent_pcacov_raw; h = ...

title(''); hold on;769 else770 eigenvalues_pcacov = latent_pcacov_var; h = ...

title('scree plot - pcacov - var'); hold on;771 end772 set(h,'Color','k','FontSize',15); hold on;773 plot(eigenvalues_pcacov,'o-','color','k');774 hold on; grid on;775 % title('scree plot','FontSize',15);776 xlabel('component number','FontSize',15); ...

ylabel('eigenvalue','FontSize',15);777 % figure;778 % bar(x_axis,eigenvalues); hold on; grid on;779 % xlabel('component number','FontSize',15); ...

ylabel('eigenvalue','FontSize',15);780 end781

782 [coeff_pca_raw,score_pca_raw,latent_pca_raw] = ...pca(data_raw,'algorithm','svd');

783 [coeff_pca_var,score_pca_var,latent_pca_var] = ...pca(data_var,'algorithm','svd');

784

785 for i = 1:2786 figure;787 if i == 1;788 eigenvalues_pca = latent_pca_raw; h = ...

title('scree plot - pca- raw'); hold on;789 else790 eigenvalues_pca = latent_pca_var; h = ...

title('scree plot - pca- var'); hold on;791 end792 set(h,'Color','k','FontSize',15); hold on;793 plot(eigenvalues_pca,'o-','color','m');

143

794 hold on; grid on;795 % title('scree plot - ',title,'FontSize',15);796 xlabel('component number','FontSize',15); ...


798 % figure;799 % bar(x_axis,eigenvalues_pca); hold on; grid on;800 % xlabel('component number','FontSize',15); ...

ylabel('eigenvalue','FontSize',15);801 end802

803 cov = cov(data_var);804 [VV,DD] = eigs(cov);805

806 figure;807 bar(coeff_pcacov_raw(:,1:2),'grouped');808 hold on; grid on;809 title('coeff - pcacov - raw - comp 1 and ...

2','FontSize',15); hold on;810 xlabel('variable','FontSize',15); ...

ylabel('coeff','FontSize',15); hold on;811 legend('comp 1','comp 2');812 set(gca,'FontSize',15);813

814 figure;815 bar(coeff_pcacov_var(:,1:2),'grouped');816 hold on; grid on;817 title('coeff - pcacov - log - comp 1 and ...



822 figure;823 bar(coeff_pca_raw(:,1:2),'grouped');824 hold on; grid on;825 title('coeff - pca - raw - comp 1 and 2','FontSize',15); ...

hold on;826 xlabel('variable','FontSize',15); ...


830 figure;831 bar(coeff_pca_var(:,1:2),'grouped');832 hold on; grid on;833 title('coeff - pca - log - comp 1 and 2','FontSize',15); ...

hold on;

144

834 xlabel('variable','FontSize',15); ...ylabel('coeff','FontSize',15); hold on;

835 legend('comp 1','comp 2');836 set(gca,'FontSize',15);837

838 figure;839 bar(1:p-3,[coeff_pca_raw(:,1) ...

coeff_pca_var(:,1)],0.5,'grouped');840 hold on; grid on;841 title('singular value decomposition - raw - log - comp ...


ylabel('coeff','FontSize',15); hold on;843 legend('raw','log');844 set(gca,'FontSize',15);845

846 figure;847 bar(1:p-3,[coeff_pca_raw(:,2) ...

coeff_pca_var(:,2)],0.5,'grouped');848 hold on; grid on;849 title('singular value decomposition - raw - log - comp ...

2','FontSize',15); hold on850 xlabel('variable','FontSize',15); ...

ylabel('coeff','FontSize',15); hold on851 legend('raw','log');852 set(gca,'FontSize',15)853

854 end % if part_code(03) == 1855

856 %% ...========================================================================

857 %% ---------- part 04: cox phm (cox proportional hazards ...model) -----------

858 % ...=========================================================================

859 %%860 if part_code(04) == 1861

862 % [b,logl,H,stats] = coxphfit(X,T,Name,Value) returns a ...p-by-1 vector,b,of coefficient estimates for a Cox ...proportional hazards regression of the observed ...responses in an n-by-1 vector,T,on the predictors in ...an n-by-p matrix X.

863 % The model does not include a constant term,and X ...cannot contain a column of 1s.

864 % Also returns a vector of coefficient estimates,with ...additional options specified by one or more ...Name,Value pair arguments

865 % and returns the loglikelihood,logl,a ...

145

structure,stats,that contains additional ...statistics,and a two-column matrix,H,that contains ...the T values in the first column and the estimated ...baseline cumulative hazard,in the second column. You ...can use any of the input arguments in the previous ...syntaxes.

866

867 [b1,logL1,H1,st1] = coxphfit(data_raw,data_time);868

869 figure;870 stairs(H1(:,1),exp(-H1(:,2)),'linestyle','-','color','k');871 hold on; grid on;872 % xx = linspace(0,100);873

874 [b2,logL2,H2,st2] = coxphfit(data_nor,data_time);875 % figure;876 stairs(H2(:,1),exp(-H2(:,2)),'linestyle','--','color','g');877 hold on; grid on;878 % xx = linspace(0,100);879

880 [b3,logL3,H3,st3] = coxphfit(data_var,data_time);881 % figure;882 stairs(H3(:,1),exp(-H3(:,2)),'linestyle',':','color','r');883 hold on; grid on;884 % xx = linspace(0,100);885 % title(sprintf('Baseline survivor function for ...

X=%g',mean(x)));886

887 b4 = abs(b1); % absolute values of cox phm coefficients ...- raw data

888

889 b5 = abs(b3); % absolute values of cox phm coefficients ...- logical data

890

891 ecdf(y_data,'censoring',censy,'function','survivor','bounds','on'); ...hold on; % kaplan-meier plot (blue graph)

892

893 title('cox phm baseline survival ...function','FontSize',15); hold on;

894 xlabel('time,t','FontSize',15); ylabel('survival ...probability,s(t)','FontSize',15); hold on;

895 legend('raw data','standardized data','logical ...data','km','Location','NorthEastOutside'); hold on;

896

897 figure;898 ecdf(y_data,'censoring',censy,'function','survivor','bounds','off'); ...

hold on; % kaplan-meier plot899 stairs(H1(:,1),exp(-H1(:,2)),'linestyle','--','color','k'); ...

hold on; grid on;

146

900 stairs(H3(:,1),exp(-H3(:,2)),'linestyle',':','color','r'); ...hold on; grid on;

901 xx = linspace(0,100);902 xlabel('time,t','FontSize',15); ylabel('survival ...

probability,s(t)','FontSize',15); hold on;903 % title('cox phm baseline survival ...

function','FontSize',15); hold on904

905 [p_h1h3,h_h1h3] = ranksum(H1(:,2),H3(:,2)); % wilcoxon ...rank sum test

906 logrank_h1h3 = logrank(H1(:,2),H3(:,2)); % log-rank test907

908 figure;909 bar(x_axis,h_rank,'y'); hold on; grid on;910 title('slop of the survival function','FontSize',15); ...



913 bar(x_axis,b2); hold on; grid on;914 title('slop of the survival function','FontSize',15); ...



917 figure;918 data_surv(1) = subplot(3,1,1);919 plot(b1,'k'); hold on; grid on; % row data920 title('cox phm coefficients','FontSize',15); hold on;921 xlabel('variable','FontSize',15); ...

ylabel('coefficient','FontSize',15); hold on;922 data_surv(2) = subplot(3,1,2);923 plot(b2,'b'); hold on; grid on; % standardized data924 hold on; grid on;925 title('cox phm coefficients','FontSize',15); hold on;926 xlabel('variable','FontSize',15); ...

ylabel('coefficient','FontSize',15); hold on;927 data_surv(3) = subplot(3,1,3);928 plot(b3,'r'); hold on; grid on; % logical data929 title('cox phm coefficients','FontSize',15); hold on;930 xlabel('variable','FontSize',15); ...

ylabel('coefficient','FontSize',15); hold on;931

932 figure;933 plot(b1,'linestyle','-','color','k'); hold on; grid on; ...

% row data934 plot(b2,'linestyle','--','color','b'); hold on; grid on; ...

% standardized data935 plot(b3,'linestyle','-.','color','r'); hold on; grid on; ...

147

% logical data936 plot(b4,'linestyle','-.','color','g'); hold on; grid on; ...

% abs logical data937 title('cox phm coefficients','FontSize',15); hold on;938 xlabel('variable','FontSize',15); ...

ylabel('coefficient','FontSize',15); hold on;939 legend('raw','standardized','logical','logical ...

absolute','Location','NorthEastOutside'); hold on;940 % plot(b2,'--','color',[0.1 0.1 0.1])941 %set(gca,'FontSize',10); hold on942

943 figure;944 plot(b1); hold on; grid on; % row data945 title('cox phm coefficients - raw data','FontSize',15); ...



948 figure;949 plot(b4,'r'); hold on; grid on; % row data950 title('cox phm coefficients - raw data (absolute ...

value)','FontSize',15); hold on;951 xlabel('variable','FontSize',15); ...


953 end % if part_code(04) == 1954

955 %% ...========================================================================

956 %% ---------- part 05: mlr (multiple linear regression ...model) --------------

957 % ...=========================================================================

958 %%959 if part_code(05) == 1960

961 % [b,bint,r,rint,stats] = regress(s,x) multiple linear ...regression using least squares

962 % http://www.mathworks.com/help/stats/regress.html963

964 % lm = fitlm(s,x) linear model965 % rlm = fitlm(s,x,'linear','RobustOpts','on') robust ...

linear model966 % mdl = LinearModel.fit(X,y) creates a linear model of ...

the responses y to a data matrix X967

968 % http://www.mathworks.com/help/stats/linearmodel.fit.html969 % ...

http://www.mathworks.com/help/stats/linear-regression-model-workflow.html

148

970

971 % [b,bint,r,rint,stats] = ...regress(data_unc(1:n,2),data_unc(1:n,4:p));

972 [b,bint,r,rint,stats] = regress(data_time,data_var);973 b_fu = regress(data_time,data_fuzzy);974

975 % lm_unc = fitlm(data_unc(1:n,4:p),data_unc(1:n,2));976 % rlm_unc = ...

fitlm(data_unc(1:n,4:p),data_unc(1:n,2),'linear','RobustOpts','on');977

978 lm_unc = fitlm(data_cens(1:n,4:p),data_cens(1:n,2));979 rlm_unc = ...

fitlm(data_cens(1:n,4:p),data_cens(1:n,2),'linear','RobustOpts','on');980 lm_var = fitlm(data_var,data_log(1:n,2));981

982 figure;983 bar(x_axis,abs(b),'k'); hold on; grid on; % plot(b); ...

hold on; grid on;984 title('multiple linear regression model','FontSize',15); ...


ylabel('y','FontSize',15); hold on;986

987 figure;988 bar(x_axis,abs(b_fu),'b'); hold on; grid on; % plot(b); ...

hold on; grid on;989 title('fuzzy multiple linear regression ...

model','FontSize',15); hold on;990 xlabel('variable','FontSize',15); ...


992 % figure;993 % plot(lm_unc); hold on; grid on;994 % figure;995 % plot(lm_var); hold on; grid on;996

997 figure;998 plot((b),'k');999 hold on; grid on;

1000 title('cumulation of all variable scores - ...randomization','FontSize',15); hold on;

1001 xlabel('variable','FontSize',15); ...ylabel('y','FontSize',15); hold on;

1002

1003 plot((lm_unc.Coefficients(2:end,1)),'b');1004 hold on; grid on;1005 title('cumulation of all variable scores - ...

randomization','FontSize',15); hold on;1006 xlabel('variable','FontSize',15); ...

149


1008

1009 plot((rlm_unc.Coefficients(2:end,1)),'r');1010 hold on; grid on;1011 title('cumulation of all variable scores - ...



1014 plot((lm_var.Coefficients(2:end,1)),'m');1015 hold on; grid on;1016 title('cumulation of all variable scores - ...



1019 end % if part_code(05) == 11020

1021 %% ...========================================================================

1022 %% ---------- part 06: k-means (clustering) ...-------------------------------

1023 % ...=========================================================================

1024 %%1025 if part_code(06) == 11026

1027 % [IDX,C,sumd,D] = kmeans(X,k) partitions the points in ...the n-by-p data matrix X into k clusters,returns the ...k cluster centroid locations in the k-by-p matrix C,

1028 % returns the within-cluster sums of point-to-centroid ...distances in the 1-by-k vector sumd and distances ...from each point to every centroid in the n-by-k ...matrix D

1029 % This iterative partitioning minimizes the sum,over all ...clusters,of the within-cluster sums of ...point-to-cluster-centroid distances. Rows of X ...correspond to points,columns correspond to variables.

1030 % kmeans returns an n-by-1 vector IDX containing the ...cluster indices of each point. By default,kmeans uses ...squared Euclidean distances. When X is a ...vector,kmeans treats it as an n-by-1 data ...matrix,regardless of its orientation.

1031 % http://www.mathworks.com/help/stats/kmeans.html1032

1033 k = 3;1034

1035 opts = statset('Display','final');

150

1036 [cidx_raw,ctrs_log] = ...kmeans(data_raw,fix((p-3)/k),'Distance','cityblock','Replicates',5,'Options',opts);

1037

1038 opts = statset('Display','final');1039 [cidx_log,ctrs_log] = ...

kmeans(data_var,fix((p-3)/k),'Distance','cityblock','Replicates',5,'Options',opts);1040

1041 figure;1042 % ...

plot(data_var(cidx_log==1,1),data_var(cidx_log==1,2),'r.',data_var(cidx_log==2,1),data_var(cidx_log==2,2),'b.',ctrs_log(:,1),ctrs_log(:,2),'kx');1043 scatter(10*(cidx_raw-1)+(1:n)',10*(cidx_log-1)+(1:n)');1044 hold on; grid on;1045 title('k-means clustering scatter','FontSize',15); hold on;1046 xlabel('variable','FontSize',15); ...

ylabel('cluster','FontSize',15); hold on;1047

1048 figure;1049 plot([cidx_raw cidx_log]);1050 hold on; grid on;1051 title('k-means clustering - raw - log','FontSize',15); ...



1054 % ...---------------------------------------------------------------------

1055

1056 data_raw_kmean = data_raw';1057 data_log_kmean = data_var';1058

1059 opts = statset('Display','final');1060 [cidx_raw_kmean,ctrs_raw_kmean] = ...

kmeans(data_raw',fix((p-3)/k),'Distance','cityblock','Replicates',5,'Options',opts);1061

1062 opts = statset('Display','final');1063 [cidx_log_kmean,ctrs_log_kmean] = ...

kmeans(data_var',fix((p-3)/k),'Distance','cityblock','Replicates',5,'Options',opts);1064

1065 figure;1066 bar(x_axis,cidx_raw_kmean,'k');1067 hold on; grid on;1068 title('variables k-means clustering - ...

raw','FontSize',15); hold on;1069 xlabel('variable','FontSize',15); ...


1071 figure;1072 bar(x_axis,cidx_log_kmean,'m');1073 hold on; grid on;

151

1074 title('variables k-means clustering - ...log','FontSize',15); hold on;

1075 xlabel('variable','FontSize',15); ...ylabel('cluster','FontSize',15); hold on;

1076

1077 end % if part_code(06) == 11078

1079 %% ...========================================================================

1080 %% ---------- part 07: rf (random forest) ...---------------------------------

1081 % ...=========================================================================

1082 %%1083 if part_code(07) == 11084

1085 % B = TreeBagger(NTrees,X,Y) creates an ensemble B of ...NTrees decision trees for predicting response Y as a ...function of predictors X.

1086 % By default TreeBagger builds an ensemble of ...classification trees.

1087 % The function can build an ensemble of regression trees ...by setting the optional input argument 'method' to ...'regression'.

1088 % http://www.mathworks.com/help/stats/treebagger.html1089

1090 b = TreeBagger(100,data_var,data_log(1:n,2),'oobpred','on');1091

1092 figure;1093 plot(oobError(b)); hold on; grid on;1094 title('random forest','FontSize',15); hold on;1095 xlabel('number of grown trees'); ylabel('out-of-bag ...

classification error');1096

1097 % stochastic_bosque(data_var,data_log(1:n,2))1098

1099 end % if part_code(07) == 11100

1101 %% ...========================================================================

1102 %% ---------- part 08: on and off comparison (doe approach) ...---------------

1103 % ...=========================================================================

1104 %%1105 if part_code(08) == 11106

1107 for i=1:p-31108 [p_rank_one(i),h_rank_one(i)] = ...

152

ranksum(data_surv,var(i).one(:,1)); % wilcoxon ...rank sum test

1109 [p_rank_zero(i),h_rank_zero(i)] = ...ranksum(data_surv,var(i).zero(:,1)); % wilcoxon ...rank sum test

1110 ∆_wilcoxon_time(i,1) = p_rank_one(i) - p_rank_zero(i);1111 end1112

1113 figure;1114 bar(x_axis,var_one_sum,'y','grouped');1115 hold on; grid off;1116 title('one by one variable time sum','FontSize',15); ...



1119 figure;1120 bar(x_axis,∆_wilcoxon_time,'r','grouped');1121 hold on; grid off;1122 title('∆ - wilcoxon rank test','FontSize',15);1123 xlabel('variable','FontSize',15); ...

ylabel('importance','FontSize',15); hold on;1124 % axis([0 17 -1 2])1125

1126 end % if part_code(08) == 11127

1128 %% ...========================================================================

1129 %% ---------- part 09: test score - brute-force (exhaustive) ...search -------

1130 % ...=========================================================================

1131 %%1132 if part_code(09) == 11133

1134 k = 3; % no. of variable combination out of all variables1135 comb = [];1136 comb = nchoosek(data_var_num(end,:),k); % ...

data_var_num(end,:) is number of variables vector1137 [comb_row comb_col] = size(comb);1138

1139 % y = data_unc(:,2)'; % (or data_time') failue time row1140 % freqy = data_unc(:,3)';1141

1142 % from part 1 ...---------------------------------------------------------

1143

1144 % [f,x] = ecdf(y,'censoring',freqy);1145 % s = 1-f;

153

1146 % figure;1147 % ecdf(y,'censoring',freqy,'function','survivor'); % ...

kaplan-meier plot1148 % hold on; grid on;1149 % title('kaplan-meier survival function','FontSize',15);1150 % xlabel('time,t','FontSize',15); ylabel('survival ...

probability,s(t)','FontSize',15);1151 % % plot(x,s,'color','r'); hold on; grid on; % survival plot1152 % % h = get(gca,'children');1153 % % set(h,'color','k');1154

1155 % [P0,S0] = polyfit(x,s,1); % poly fit1156 % z = polyval(P0,x);1157 % % plot(x,s,'o',x,z,'-'); hold on; grid on;1158 % plot(x,z,'-','Color','r');1159 % hold on; grid on;1160 % axis([0 data_cal(n,2) 0 1]);1161

1162 % slop = [];1163

1164 % end of from part 1 ...--------------------------------------------------

1165

1166 % figure;1167

1168 sum_limit_effect_var = 0;1169

1170 for i = 1:comb_row1171

1172 data_comb = []; % instructing survival data from set #i1173 data_comb = data_var(:,comb(i,1:k)); % for set #i1174 data_comb(:,k+1) = sum(data_comb,2);1175 % data_comb(:,k+2) = sign(data_comb(:,k+1))1176 % data_comb(:,k+2) = floor(data_comb(:,k+1)/k);1177

1178 data_comb(1:n,k+2) = data_time; % time1179 % data_comb(:,k+4) = data_comb(:,k+2).*data_comb(:,k+3);1180

1181 data_comb(1:n,k+3) = data_log(1:n,3); % status1182

1183 var_comb = []; % instructing survival data from set ...#i based on element k+1 (for time t,select if at ...least "sum_limit_effect_var" variables at 1)

1184 % var_comb = data_comb(:,k+4);1185 % var_comb(find([var_comb(:,1)]==0),:) = []; % ...

remove not effective failure times from set #i of ...all combinations

1186

1187 var_comb(:,1:2) = ...

154

data_comb(find(data_comb(:,k+1)>sum_limit_effect_var),k+2:k+3); ...% remove ineffective failure times from set #i of ...all combinations

1188

1189 if isempty(var_comb) == 0;1190

1191 var_comb_surv = [];1192 [var_comb_surv(:,1),var_comb_surv(:,2)] = ...

ecdf(var_comb(:,1),'censoring',var_comb(:,2),'function','survivor'); ...% kaplan-meier plot

1193

1194 [comb(i,k+1),comb(i,k+2)] = ...ranksum(data_time,var_comb(:,1)); % wilcoxon ...rank sum test (col k+1 and k+2) for time

1195 comb(i,k+3) = logrank(data_time,var_comb(:,1)); ...% logrank test (col k+3) for time

1196

1197 [comb(i,k+4),comb(i,k+5)] = ...ranksum(data_surv,var_comb_surv(:,1)); % ...wilcoxon rank sum test (col k+4 and k+5) for ...survival

1198 % comb(i,k+6) = ...logrank(data_surv,var_comb_surv(:,1)); % ...logrank test (col k+6) for survival

1199

1200 end1201

1202 % kaplan-meier plot for all ...=======================================

1203

1204 plot_km_for_all = 0;1205

1206 if plot_km_for_all == 11207

1208 if comb(i,k+2)==0; color_km_all = 'b'; end;1209 if comb(i,k+2)==1; color_km_all = 'r'; end;1210

1211 % figure;1212 [f_km_all,x_km_all] = ...

ecdf(var_comb(:,1),'function','survivor'); % ...kaplan-meier plot

1213

1214 stairs(x_km_all,f_km_all,color_km_all);1215

1216 hold on; grid on;1217 % title('kaplan-meier survival ...

function','FontSize',15);1218 % xlabel('time,t','FontSize',15); ...

ylabel('survival ...

155

probability,s(t)','FontSize',15);1219 % % plot(x_km_all,1-f_km_all,'color','r'); hold ...

on; grid on; % survival plot1220 % % h = get(gca,'children');1221 % % set(h,'color','k');1222

1223 [p_km_all,s_km_all] = ...polyfit(var_comb_failure(:,2),var_comb_surv(:,1),1); ...% poly fit

1224 % z_km_all = ...polyval(p_km_all,var_comb_failure(:,2));

1225 % ...plot(var_comb_failure(:,2),var_comb_surv(:,1),'o',var_comb_failure(:,2),z_km_all,'-'); ...hold on; grid on;

1226 % ...plot(var_comb_failure(:,2),z_km_all,'-','Color','r');

1227 hold on; grid on;1228 axis([0 data_cal(n,2) 0 1]);1229

1230

1231 y_km = data_cens(:,2)'; % or data_time'1232 freqy_km = data_cens(:,3)';1233 [f_km,x_km] = ecdf(y_data,'censoring',freqy_km);1234 data_surv_km = 1-f_km;1235

1236 % figure;1237 ecdf(y_km,'censoring',freqy_km,'function','survivor'); ...

% kaplan-meier plot1238 hold on; grid on;1239 title('kaplan-meier survival ...

function','FontSize',15);1240 xlabel('time,t','FontSize',15); ylabel('survival ...

probability,s(t)','FontSize',15);1241 % plot(x,s,'color','r'); hold on; grid on; % ...

survival plot1242 h = get(gca,'children');1243 set(h,'color','k');1244

1245 [p_km,s_km] = polyfit(x_km,data_surv_km,1); % ...poly fit

1246 z_km = polyval(p_km,x_km);1247 % plot(x,s,'o',x,z,'-'); hold on; grid on;1248 plot(x_km,z_km,'-','Color','m:');1249 hold on; grid on;1250 axis([0 data_cal(n,2) 0 1]);1251

1252 end1253

1254 % ...

156

=================================================================1255

1256 end1257

1258 data_nptest = [];1259 data_nptest_cum = [];1260 data_nptest_cum_unique = [];1261 data_nptest_elem_freq = [];1262 data_nptest_elem = [];1263

1264 data_nptest = comb;1265

1266 % sortcomb = sortrows(comb,4)1267

1268 % ---------- keep rejected sets (positive approach) ...-------------------

1269

1270 % k+2 or k+51271 data_nptest(find([data_nptest(:,k+2)]==0),:) = []; % ...

keep rejected sets (worst variables): eleminate sets ...with wilcoxon rank = 0 and keep wilcoxon rank = 1 ...(keep wilcoxon rank < 0.05)

1272 % data_nptest(find([data_nptest(:,k+1)]>0.05),:) = []; % ...keep rejected sets: eleminate sets with wilcoxon rank ...> 0.05 and keep wilcoxon rank < 0.05

1273

1274 % data_nptest(find([data_nptest(:,k+3)]>0.05),:) = []; % ...keep rejected sets: eleminate sets with logrank > ...0.05 and keep logrank < 0.05

1275

1276 % ---------- eleminate rejected sets (negative approach) ...--------------

1277

1278 % data_nptest(find([data_nptest(:,k+2)]==1),:) = []; % ...eleminate rejected sets: eleminate sets with wilcoxon ...rank = 1 and keep wilcoxon rank = 0 (keep wilcoxon ...rank > 0.05)

1279 % data_nptest(find([data_nptest(:,k+1)]<0.05),:) = []; % ...eleminate rejected sets: eleminate sets with wilcoxon ...rank < 0.05 and keep wilcoxon rank > 0.05

1280

1281 % data_nptest(find([data_nptest(:,k+3)]<0.05),:) = []; % ...eleminate rejected sets: eleminate sets with logrank ...< 0.05 and keep logrank > 0.05

1282

1283 % ...---------------------------------------------------------------------

1284

1285 data_nptest = data_nptest(:,1:k);

157

1286 [data_row data_col] = size(data_nptest);1287

1288 for i = 1:data_col % put all selected var numbers in a ...single vector

1289 data_nptest_cum((i-1)*data_row+1:i*data_row,1) = ...data_nptest(:,i);

1290 end1291

1292 data_nptest_cum_unique = unique(data_nptest_cum); % ...elements in data_nptest_cum

1293 [data_nptest_elem_freq,data_nptest_elem] = ...hist(data_nptest_cum,data_nptest_cum_unique); % ...elements in data_nptest_cum and their frequencies

1294

1295 figure;1296 bar(data_nptest_cum_unique,histc(data_nptest_cum,data_nptest_cum_unique)); ...

% elements in data_nptest_cum and their frequencies - ...method 1

1297 hold on; grid on;1298 title('wilcoxon rank or logrank frequencies (method ...

1)','FontSize',15); hold on;1299 xlabel('variable','FontSize',15); ...

ylabel('censoring','FontSize',15); hold on;1300

1301 figure;1302 bar(data_nptest_elem,data_nptest_elem_freq); % elements ...

in data_nptest_cum and their frequencies - method 21303 hold on; grid on;1304 title('wilcoxon rank or logrank frequencies (method ...



1307 data_nptest_elem_comb = data_nptest_elem; % for overal ...plot at the end of part 10

1308 data_nptest_elem_freq_comb = 9*((data_nptest_elem_freq - ...min(data_nptest_elem_freq))/(max(data_nptest_elem_freq) ...- min(data_nptest_elem_freq))) + 1; % 1 to 10 score ...normalized for overal plot at the end of part 10

1309

1310 % ---------- cumulative test scores for all variables ...-----------------

1311

1312 data_comb_nptest_cum = [];1313

1314 for i = 1:p-31315 data_comb_nptest_cum(i,1) = i;1316 data_comb_nptest_cum(i,2) = 0;1317 data_comb_nptest_cum(i,3) = 0;

158

1318 end;1319

1320 for i = 1:comb_row1321 for j = 1:comb_col % = k1322 data_comb_nptest_cum(comb(i,j),2) = ...

data_comb_nptest_cum(comb(i,j),2) + ...comb(i,k+1); % cumulation of all wilcoxon ...rank for all variables

1323 data_comb_nptest_cum(comb(i,j),3) = ...data_comb_nptest_cum(comb(i,j),3) + ...comb(i,k+3); % cumulation of all logrank for ...all variables

1324 end1325 end1326

1327 figure;1328 bar(data_comb_nptest_cum(:,2),'r','grouped');1329 hold on; grid off;1330 title('cumulation of all wilcoxon rank','FontSize',15); ...



1333 figure;1334 bar(data_comb_nptest_cum(:,3),'b','stacked');1335 hold on; grid off;1336 title('cumulation of all logrank','FontSize',15); hold on;1337 xlabel('variable','FontSize',15); ...


1339 figure;1340 bar(data_comb_nptest_cum(:,2:3),'grouped');1341 hold on; grid off;1342 title('cumulation of all wilcoxon rank and ...

logrank','FontSize',15); hold on;1343 xlabel('variable','FontSize',15); ...


1345 figure;1346 bar(data_comb_nptest_cum(:,2:3),'stacked');1347 hold on; grid off;1348 title('cumulation of all wilcoxon rank and ...



1351 data_comb_nptest_cum_norm = ...9*((data_comb_nptest_cum(:,2) - ...min(data_comb_nptest_cum(:,2)))/(max(data_comb_nptest_cum(:,2)) ...

159

- min(data_comb_nptest_cum(:,2)))) + 1; % 1 to 10 ...score normalized for overal plot at the end of part 10

1352

1353 end % if part_code(09) == 11354

1355 %% ...========================================================================

1356 %% ---------- part 10: test score - random subset selection ...---------------

1357 % ...=========================================================================

1358 %%1359 if part_code(10) == 11360

1361 m = 100; % no. of randomly selected sets1362 k = 3; % no. of variable combination out of all variables1363 select = [];1364

1365 for i = 1:m1366 select(i,:) = randsample(data_var_num(end,:),k,false);1367 end1368


1371 figure;1372

1373 for i = 1:m % trials1374

1375 data_select = []; % instructing survival data from ...set #i

1376 data_select = data_var(:,select(i,1:k)); % for set #i1377 data_select(:,k+1) = sum(data_select,2);1378 data_select(1:n,k+2) = data_time;1379

1380 var_select = []; % instructing survival data from ...set #i based on element k+1 (for time t,select if ...at least "sum_limit_effect_var" variables at 1)

1381 var_select = ...data_select(find(data_select(:,k+1)>sum_limit_effect_var),k+2); ...% remove ineffective failure times from set #i of ...all combinations

1382 var_select(:,2) = 1; % just as a frequency of ...failure time in order to plot kaplan-meier ...survival function

1383

1384 if isempty(var_select) == 0;1385

1386 var_select_surv = [];1387 [var_select_surv(:,1),var_select_surv(:,2)] = ...

160

ecdf(var_select(:,1),'censoring',var_select(:,2),'function','survivor'); ...% kaplan-meier plot

1388

1389 [select(i,k+1),select(i,k+2)] = ...ranksum(data_time,var_select(:,1)); % ...wilcoxon rank sum test (col k+1 and k+2) for time

1390 select(i,k+3) = ...logrank(data_time,var_select(:,1)); % logrank ...test (col k+3) for time

1391

1392 [select(i,k+4),select(i,k+5)] = ...ranksum(data_surv,var_select_surv(:,1)); % ...wilcoxon rank sum test (col k+4 and k+5) for ...survival

1393 % select(i,k+6) = ...logrank(data_surv,var_comb_surv(:,1)); % ...logrank test (col k+6) for survival

1394

1395 end1396

1397 % kaplan-meier plot for all ...=======================================

1398

1399 plot_km_for_all = 0;1400

1401 if plot_km_for_all == 11402

1403 if select(i,k+2)==0; color_km_all = [0.8 0.8 ...0.8]; end;

1404 if select(i,k+2)==1; color_km_all = [0.0 0.0 ...0.0]; end;

1405

1406

1407 c_km_all_code = 1 - select(i,k+4);1408 % if select(i,k+2)==0; color_km_all = ...

[c_km_all_code c_km_all_code c_km_all_code]; end1409 % if select(i,k+2)==1; color_km_all = ...

[c_km_all_code c_km_all_code c_km_all_code]; end1410

1411 % figure;1412 [f_km_all,x_km_all] = ...

ecdf(var_select(:,1),'function','survivor'); ...% kaplan-meier plot

1413

1414 stairs(x_km_all,f_km_all,'color',color_km_all);1415


function','FontSize',15);

161

1418 % xlabel('time,t','FontSize',15); ...ylabel('survival ...probability,s(t)','FontSize',15);

1419 % % plot(x_km_all,1-f_km_all,'color','r'); hold ...on; grid on; % survival plot

1420 % % h = get(gca,'children');1421 % % set(h,'color','k');1422

1423 [p_km_all,s_km_all] = ...polyfit(var_select_failure(:,2),var_select_surv(:,1),1); ...% poly fit

1424 % z_km_all = ...polyval(p_km_all,var_comb_failure(:,2));

1425 % ...plot(var_comb_failure(:,2),var_comb_surv(:,1),'o',var_comb_failure(:,2),z_km_all,'-'); ...hold on; grid on;

1426 % ...plot(var_comb_failure(:,2),z_km_all,'-','Color','r');

1427 hold on; grid on;1428 axis([0 data_cal(n,2) 0 1]);1429

1430

1431 y_km = data_cens(:,2)'; % or data_time'1432 freqy_km = data_cens(:,3)';1433 [f_km,x_km,l_bound_km,u_bound_km] = ...

ecdf(y_km,'censoring',freqy_km);1434 data_surv_km = 1-f_km;1435

1436 % figure;1437 % ...

ecdf(y_km,'censoring',freqy_km,'function','survivor'); ...% kaplan-meier plot

1438 stairs(x_km,data_surv_km,'color',[0.8 0.0 ...0.0],'LineWidth',1.5);

1439 stairs(x_km,1-l_bound_km,'r:'); ...stairs(x_km,1-u_bound_km,'r:');


function','FontSize',15);1442 xlabel('time, t','FontSize',15); ...

ylabel('survival probability, ...s(t)','FontSize',15);

1443 % plot(x,s,'color','r'); hold on; grid on; % ...survival plot

1444 % h = get(gca,'children');1445 % set(h,'color','k');1446

1447 [p_km,s_km] = polyfit(x_km,data_surv_km,1); % ...poly fit

162

1448 z_km = polyval(p_km,x_km);1449 % plot(x,s,'o',x,z,'-'); hold on; grid on;1450 % plot(x_km,z_km,'m:');1451 hold on; grid on;1452 axis([0 data_cal(n,2) 0 1]);1453

1454 end1455

1456 % ...=================================================================

1457

1458 end1459


1466 data_nptest = select;1467

1468 % sortcomb = sortrows(comb,4)1469

1470 % ---------- keep rejected sets (positive approach) ...-------------------

1471

1472 % k+2 or k+51473 data_nptest(find([data_nptest(:,k+2)]==0),:) = []; % ...

keep rejected sets: eleminate sets with wilcoxon rank ...= 0 and keep wilcoxon rank = 1 (keep wilcoxon rank < ...0.05)

1474 % data_nptest(find([data_nptest(:,k+1)]>0.05),:) = []; % ...keep rejected sets: eleminate sets with wilcoxon rank ...> 0.05 and keep wilcoxon rank < 0.05

1475

1476 % data_nptest(find([data_nptest(:,k+3)]>0.05),:) = []; % ...keep rejected sets: eleminate sets with logrank > ...0.05 and keep logrank < 0.05

1477

1478 % ---------- eleminate rejected sets (negative approach) ...--------------

1479

1480 % data_nptest(find([data_nptest(:,k+2)]==1),:) = []; % ...eleminate rejected sets: eleminate sets with wilcoxon ...rank = 0 and keep wilcoxon rank = 1 (keep wilcoxon ...rank > 0.05)

1481 % data_nptest(find([data_nptest(:,k+1)]<0.05),:) = []; % ...eleminate rejected sets: eleminate sets with wilcoxon ...rank > 0.05 and keep wilcoxon rank > 0.05

163

1482

1483 % data_nptest(find([data_nptest(:,k+3)]<0.05),:) = []; % ...eleminate rejected sets: eleminate sets with logrank ...> 0.05 and keep logrank > 0.05

1484

1485 % ...---------------------------------------------------------------------

1486

1487 data_nptest = data_nptest(:,1:k);1488 [data_row data_col] = size(data_nptest);1489



1492 end1493



1496

1497 figure;1498 bar(data_nptest_cum_unique,histc(data_nptest_cum,data_nptest_cum_unique)); ...


1499 hold on; grid off;1500 title('wilcoxon rank or logrank frequencies (method ...



1503 figure;1504 bar(data_nptest_elem,data_nptest_elem_freq); % elements ...

in data_nptest_cum and their frequencies - method 21505 hold on; grid off;1506 title('wilcoxon rank or logrank frequencies (method ...



1509 data_nptest_elem_select = data_nptest_elem; % for overal ...plot at the end of part 10

1510 data_nptest_elem_freq_select = 9*((data_nptest_elem_freq ...- ...min(data_nptest_elem_freq))/(max(data_nptest_elem_freq) ...- min(data_nptest_elem_freq))) + 1; % 1 to 10 score ...normalized for overal plot at the end of part 10

164

1511

1512 % ---------- cumulative test scores for all variables ...-----------------

1513

1514 data_select_nptest_cum = [];1515

1516 for i = 1:p-31517 data_select_nptest_cum(i,1) = i;1518 data_select_nptest_cum(i,2) = 0;1519 data_select_nptest_cum(i,3) = 0;1520 end;1521

1522 for i = 1:m1523 for j = 1:k1524 data_select_nptest_cum(select(i,j),2) = ...

data_select_nptest_cum(select(i,j),2) + ...select(i,k+1); % cumulation of all wilcoxon ...rank for all variables

1525 data_select_nptest_cum(select(i,j),3) = ...data_select_nptest_cum(select(i,j),3) + ...select(i,k+3); % cumulation of all logrank ...for all variables

1526 end1527 end1528

1529 figure;1530 bar(data_select_nptest_cum(:,2),'r','grouped');1531 hold on; grid off;1532 title('cumulation of all wilcoxon rank','FontSize',15); ...



1535 figure;1536 bar(data_select_nptest_cum(:,3),'b','stacked');1537 hold on; grid off;1538 title('cumulation of all logrank','FontSize',15); hold on;1539 xlabel('variable','FontSize',15); ...


1541 figure;1542 bar(data_select_nptest_cum(:,2:3),'grouped');1543 hold on; grid off;1544 title('cumulation of all wilcoxon rank and ...



1547 figure;

165

1548 bar(data_select_nptest_cum(:,2:3),'stacked');1549 hold on; grid off;1550 title('cumulation of all wilcoxon rank and ...



1553 data_select_nptest_cum_norm = ...9*((data_select_nptest_cum(:,2) - ...min(data_select_nptest_cum(:,2)))/(max(data_select_nptest_cum(:,2)) ...- min(data_select_nptest_cum(:,2)))) + 1; % 1 to 10 ...score normalized for overal plot at the end of part 10

1554

1555 figure;1556 subplot(2,1,1);1557 bar(data_nptest_elem_comb,data_nptest_elem_freq_comb,'FaceColor',[0.1 ...

0.1 0.1]); % wilcoxon1558 hold on; grid off;1559 bar(data_nptest_elem_select,data_nptest_elem_freq_select,0.4,'FaceColor',[1.0 ...

0.6 0.8]); % wilcoxon1560 hold on; grid off;1561 % title('...','FontSize',15); hold on1562 xlabel('variable','FontSize',15); ...

ylabel('normalized';'efficiency ...score','FontSize',15); hold on;

1563 axis([0 18 0 10]);1564 subplot(2,1,2);1565 bar(data_comb_nptest_cum_norm,'FaceColor',[0.8 0.8 ...

0.8]); % wilcoxon1566 hold on; grid off;1567 bar(data_select_nptest_cum_norm,0.4,'FaceColor',[0.0 0.3 ...

0.2]); % wilcoxon1568 hold on; grid off;1569 % title('...','FontSize',15); hold on1570 xlabel('variable','FontSize',15); ...

ylabel('normalized';'inefficiency ...score','FontSize',15); hold on;

1571 axis([0 18 0 10]);1572

1573 end % if part_code(10) == 11574

1575 %% ...========================================================================

1576 %% ---------- part 11: time score - brute-force (exhaustive) ...search -------

1577 % ...=========================================================================

1578 %%1579 if part_code(11) == 1

166

1580

1581 k = 3; % no. of variable combination out of all variables1582 comb = [];1583 comb = nchoosek(data_var_num(end,:),k);1584 [comb_row comb_col] = size(comb);1585

1586 sum_limit_effect_var = k-1;1587

1588 for i = 1:comb_row1589

1590 data_comb = []; % instructing survival data from set #i1591 data_comb = data_var(:,comb(i,1:k)); % for set #i1592 data_comb(:,k+1) = sum(data_comb,2);1593 data_comb(1:n,k+2) = data_time;1594

1595 var_comb = []; % instructing survival data from set ...#i based on element k+1 (for time t,select if at ...least "sum_limit_effect_var" variables at 1)

1596 var_comb = ...data_comb(find(data_comb(:,k+1)==sum_limit_effect_var),k+2); ...% remove ineffective failure times from set #i of ...all combinations

1597 var_comb(:,2) = 1; % just as a frequency of failure ...time in order to plot kaplan-meier survival function

1598

1599 var_comb_sum = [];1600 var_comb_sum = sum(var_comb(:,1)); % score of set = ...

sum of failure times includes all variables in ...set #i

1601

1602 comb(i,k+1) = var_comb_sum; % add score to last col ...of set #i

1603

1604 end1605

1606 data_comb_time_cum = []; % cumulative scores for all ...variables

1607

1608 for i = 1:p-31609 data_comb_time_cum(i,1) = i;1610 data_comb_time_cum(i,2) = 0;1611 end;1612

1613 for i = 1:comb_row1614 for j = 1:comb_col % = k1615 data_comb_time_cum(comb(i,j),2) = ...

data_comb_time_cum(comb(i,j),2) + ...comb(i,k+1); % cumulation of all scores for ...all variables

167

1616 end1617 end1618

1619 figure;1620 bar(data_comb_time_cum(:,2),'k','grouped');1621 hold on; grid off;1622 title('cumulation of all variable scores - ...

combination','FontSize',15); hold on;1623 xlabel('variable','FontSize',15); ...


1625 end % if part_code(11) == 11626

1627 %% ...========================================================================

1628 %% ---------- part 12: time score - random subset selection ...---------------

1629 % ...=========================================================================

1630 %%1631 if part_code(12) == 11632

1633 m = 100; % no. of randomly selected sets1634 k = 3; % no. of variable combination out of all variables1635 select = [];1636


1641 sum_limit_effect_var = k;1642

1643 for i = 1:m % trials1644

1645 data_select = []; % instructing survival data from ...set #i

1646 data_select = data_var(:,select(i,1:k)); % for set #i1647 data_select(:,k+1) = sum(data_select,2);1648 data_select(1:n,k+2) = data_time;1649

1650 var_select = []; % instructing survival data from ...set #i based on element k+1 (for time t,select if ...at least "sum_limit_effect_var" variables at 1)

1651 var_select = ...data_select(find(data_select(:,k+1)==sum_limit_effect_var),k+2); ...% remove ineffective failure times from set #i of ...all combinations

1652 var_select(:,2) = 1; % just as a frequency of ...failure time in order to plot kaplan-meier ...

168

survival function1653

1654 var_select_sum = [];1655 var_select_sum = sum(var_select(:,1)); % score of ...

set = sum of failure times includes all variables ...in set #i

1656

1657 select(i,k+1) = var_select_sum; % add score to last ...col of set #i

1658

1659 end1660

1661 data_select_time_cum = []; % cumulative scores for all ...variables

1662

1663 for i = 1:p-31664 data_select_time_cum(i,1) = i;1665 data_select_time_cum(i,2) = 0;1666 end;1667

1668 for i = 1:m1669 for j = 1:k1670 data_select_time_cum(select(i,j),2) = ...

data_select_time_cum(select(i,j),2) + ...select(i,k+1); % cumulation of all scores for ...all variables

1671 end1672 end1673

1674 figure;1675 bar(data_select_time_cum(:,2),'r','grouped');1676 hold on; grid off;1677 title('cumulation of all variable scores - ...



1680 end % if part_code(12) == 11681

1682 %% ...========================================================================

1683 %% ---------- part 13: bisection greedy clustering algorithm ...--------------

1684 % ...=========================================================================

1685 %%1686 if part_code(13) == 11687

1688 m = 100; % no. of trials

169

1689 clus_limit = 3; % min no. of cluster members1690 var_num = p-3;1691 var_vector = [];1692 var_vector = data_var_num(end,:);1693 var_rand = [];1694


1701 for i = 1:m % trials1702

1703 greedy_cluster_num = var_num/2;1704 var_rand = [];1705 greedy_a_var_num = [];1706 greedy_b_var_num = [];1707

1708 var_rand = var_vector(randperm(length(var_vector))); ...% random sort of var numbers

1709

1710 while greedy_cluster_num ≥ clus_limit1711

1712 k = greedy_cluster_num;1713 sum_limit_effect_var = greedy_cluster_num-2;1714

1715 greedy_a_var_num = var_rand(1:greedy_cluster_num);1716 greedy_b_var_num = ...

var_rand(greedy_cluster_num+1:end);1717

1718 % cluster a1719 greedy_a = [];1720 data_greedy_a = []; % cluster a1721 data_greedy_a = ...

data_var(:,greedy_a_var_num(1,1:k)); % ...instructing cluster a

1722 data_greedy_a(:,k+1) = sum(data_greedy_a,2);1723 data_greedy_a(1:n,k+2) = data_time;1724

1725 var_greedy_a = []; % instructing survival data ...from cluster a based on element k+1 (for time ...t,select if at least "sum_limit_effect_var" ...variables at 1)

1726 var_greedy_a = ...data_greedy_a(find(data_greedy_a(:,k+1)>sum_limit_effect_var),k+2); ...% remove ineffective failure times from ...cluster a

1727 var_greedy_a(:,2) = 1; % just as a frequency of ...

170

failure time in order to plot kaplan-meier ...survival function

1728

1729 [greedy_a(1,1),greedy_a(1,2)] = ...ranksum(data_time,var_greedy_a(:,1)); % ...wilcoxon rank sum test (col k+1 and k+2)

1730 greedy_a(1,3) = ...logrank(data_time,var_greedy_a(:,1)); % ...logrank test (col k+3)

1731

1732 % cluster b1733 greedy_b = [];1734 data_greedy_b = []; % cluster b1735 data_greedy_b = ...

data_var(:,greedy_a_var_num(1,1:k)); % ...instructing cluster b

1736 data_greedy_b(:,k+1) = sum(data_greedy_b,2);1737 data_greedy_b(1:n,k+2) = data_time;1738

1739 var_greedy_b = []; % instructing survival data ...from cluster b based on element k+1 (for time ...t,select if at least "sum_limit_effect_var" ...variables at 1)

1740 var_greedy_b = ...data_greedy_b(find(data_greedy_b(:,k+1)>sum_limit_effect_var),k+2); ...% remove ineffective failure times from ...cluster b

1741 var_greedy_b(:,2) = 1; % just as a frequency of ...failure time in order to plot kaplan-meier ...survival function

1742

1743 [greedy_b(1,1),greedy_b(i,2)] = ...ranksum(data_time,var_greedy_b(:,1)); % ...wilcoxon rank sum test (col k+1 and k+2)

1744 greedy_b(1,3) = ...logrank(data_time,var_greedy_b(:,1)); % ...logrank test (col k+3)

1745

1746 var_rand = [];1747

1748 if greedy_a(1,1) > greedy_b(1,1)1749 var_rand = greedy_a_var_num;1750 var_rand_test = greedy_a_var_num;1751 var_rand_test(1,4) = greedy_a(1,1);1752 else1753 var_rand = greedy_b_var_num;1754 var_rand_test = greedy_a_var_num;1755 var_rand_test(1,4) = greedy_b(1,1);1756 end

171

1757

1758 greedy_cluster_num = greedy_cluster_num/2;1759

1760 end % while1761

1762 data_nptest = [data_nptest; var_rand];1763

1764 end % for i = 1:m % trials1765




1770 end1771



1774

1775 figure;1776 bar(data_nptest_cum_unique,histc(data_nptest_cum,data_nptest_cum_unique),'m'); ...





1781 figure;1782 bar(data_nptest_elem,data_nptest_elem_freq,'c'); % ...

elements in data_nptest_cum and their frequencies - ...method 2




1787 end % if part_code(13) == 11788

1789 %% ...========================================================================

1790 %% ---------- part 14: splitting semi-greedy clustering ...algorithm --------------

172

1791 % ...=========================================================================

1792 %%1793 if part_code(14) == 11794

1795 m = 100; % no. of trials1796 k = 3; % no. of cluster members1797 var_num = p-3;1798 % clus_nu = ((p-3)- mod(p-3,k))/k;1799 clus_nu = fix((p-3)/k); %1800

1801 var_vector = [];1802 var_vector = data_var_num(end,:);1803 var_rand = [];1804


1811 for i = 1:m % trials1812

1813 var_rand = [];1814 greedy(clus_nu) = ...

struct('var_num',[],'data',[],'var',[],'test',[]);1815

1816 % var_rand = ...var_vector(randperm(length(var_vector))); % ...random sort of var numbers

1817 var_rand = randperm(p-3); % random sort of var numbers1818


1821 for j = 1:clus_nu1822 greedy(j).var_num = var_rand(k*(j-1)+1:k*j);1823 end1824

1825 % greedy(clus_nu).var_num = ...var_rand(k*(clus_nu-1)+1:end)

1826

1827 for j = 1:clus_nu % cluster #j1828 [greedy_var_num_row greedy_var_num_col] = ...

size(greedy(j).var_num);1829 kk = greedy_var_num_col; % is equal to k for ...

regular clusters and it would be less than k ...if put remains is an extra cluster

1830 greedy(j).data = []; % cluster #j1831 greedy(j).data = ...

173

data_var(:,greedy(j).var_num(1,1:kk)); % ...instructing cluster #j (var_num(1,1:k))

1832 greedy(j).data(:,kk+1) = sum(greedy(j).data,2);1833 greedy(j).data(1:n,kk+2) = data_time;1834

1835 greedy(j).var = []; % instructing survival data ...from cluster #j based on element k+1 (for ...time t,select if at least ..."sum_limit_effect_var" variables at 1)

1836 greedy(j).var = ...greedy(j).data(find(greedy(j).data(:,kk+1)>sum_limit_effect_var),kk+2); ...% remove ineffective failure times from ...cluster #j

1837 greedy(j).var(:,2) = 1; % just as a frequency of ...failure time in order to plot kaplan-meier ...survival function

1838

1839 [greedy(j).test(1,1),greedy(j).test(1,2)] = ...ranksum(data_time,greedy(j).var(:,1)); % ...wilcoxon rank sum test (col k+1 and k+2)

1840 greedy(j).test(1,3) = ...logrank(data_time,greedy(j).var(:,1)); % ...logrank test (col k+3)

1841

1842 greedy_result(1,j) = j;1843 greedy_result(2,j) = greedy(j).test(1,1);1844

1845 end1846

1847 [max_num,max_index] = max(greedy_result(2,:));1848

1849 var_rand = greedy(greedy_result(1,max_index)).var_num;1850 data_nptest = [data_nptest; var_rand];1851





1858 end1859



174

1862

1863 figure;1864 bar(data_nptest_cum_unique,histc(data_nptest_cum,data_nptest_cum_unique),'FaceColor',[0.0,0.4,0.4],'EdgeColor',[0.0,0.1,0.1]); ...


1865 hold on; grid off;1866 title('splitting semi-greedy clustering - wilcoxon rank ...

or logrank frequencies (method 1)','FontSize',15); ...hold on;

1867 xlabel('variable','FontSize',15); ...ylabel('censoring','FontSize',15); hold on;

1868

1869 figure;1870 bar(data_nptest_elem,data_nptest_elem_freq,'FaceColor',[0.4,0.4,0.4],'EdgeColor',[0.1,0.1,0.1]); ...





1874

1875 % survival comparision od sets (instead of time)1876

1877 end % if part_code(14) == 11878

1879 %% ...========================================================================

1880 %% ---------- part 15: weighted standardized score method ...-----------------

1881 % ...=========================================================================

1882 %%1883 if part_code(15) == 11884

1885 for i= 1:p-31886 weighted_score(:,i) = data_time.*data_nor(:,i);1887 end1888

1889 weighted_score_sum(1,:) = abs(sum(weighted_score(:,:)));1890

1891 figure;1892 bar(x_axis,weighted_score_sum,'FaceColor',[0.0,0.0,0.4],'EdgeColor',[0.1,0.1,0.1]);1893 hold on; grid off;1894 title('one by one comparison weighted score','FontSize',15);1895 xlabel('variable','FontSize',15); ...

ylabel('importance','FontSize',15); hold on;1896 % axis([0 17 -1 2])

175

1897

1898

1899 m = 100; % no. of randomly selected sets1900 k = 3; % no. of variable combination out of all variables1901 weight = [];1902

1903 for i = 1:m1904 weight(i,:) = randsample(data_var_num(end,:),k,false);1905 end1906

1907 % weight = nchoosek(data_var_num(end,:),k);1908 [weight_row weight_col] = size(weight);1909

1910 for i = 1:weight_row1911

1912 data_weight = []; % instructing survival data from ...set #i

1913 data_weight = data_var(:,weight(i,1:k)); % for set #i1914 data_weight(:,k+1) = sum(data_weight,2);1915 data_weight(1:n,k+2) = data_time;1916 data_weight(:,k+3) = ...

data_weight(:,k+1).*data_weight(:,k+2); % wighted ...score of each observation of set #i

1917

1918 weight(i,k+1) = sum(data_weight(:,k+3)); % wighted ...score of set #i

1919

1920 end1921

1922 data_weight_cum = [];1923

1924 for i = 1:p-31925 data_weight_cum(i,1) = i;1926 data_weight_cum(i,2) = 0;1927 end;1928

1929 for i = 1:weight_row1930 for j = 1:weight_col % = k1931 data_weight_cum(weight(i,j),2) = ...

data_weight_cum(weight(i,j),2) + ...weight(i,k+1); % cumulation of all weighted ...scores for all variables

1932 end1933 end1934

1935 figure;1936 bar(x_axis,data_weight_cum(:,2),'FaceColor',[0.2,0.2,0.2],'EdgeColor',[0.1,0.1,0.1]);1937 hold on; grid off;1938 title('combination weighted score','FontSize',15); hold on;

176

1939 xlabel('variable','FontSize',15); ylabel('combination ...weighted score','FontSize',15); hold on;

1940

1941 end % if part_code(15) == 11942

1943 %% ...========================================================================

1944 %% ---------- part 16: clustering aft regression ...--------------------------

1945 % ...=========================================================================

1946 %%1947 if part_code(16) == 11948

1949 k = 3; % no. of variable combination out of all variables1950 comb_reg = [];1951 comb_reg = nchoosek(data_var_num(end,:),k); % ...

data_var_num(end,:) is number of variables vector1952 [comb_reg_row comb_reg_col] = size(comb_reg);1953

1954 for i = 1:comb_reg_row1955

1956 % raw data ...........................................................

1957 data_reg_comb_raw = []; % instructing survival data ...from set #i

1958 data_reg_comb_raw = data_raw(:,comb_reg(i,1:k)); % ...for set #i

1959 data_reg_comb_raw(:,k+1) = sum(data_reg_comb_raw,2);1960 data_reg_comb_raw(1:n,k+2) = data_time; % time1961 data_reg_comb_raw(1:n,k+3) = data_log(1:n,3); % status1962

1963 reg_raw = regress(data_time,data_reg_comb_raw(:,1:k));1964 % ...

.................................................................1965

1966 data_reg_comb_var = []; % instructing survival data ...from set #i

1967 data_reg_comb_var = data_var(:,comb_reg(i,1:k)); % ...for set #i

1968 data_reg_comb_var(:,k+1) = sum(data_reg_comb_var,2);1969 data_reg_comb_var(1:n,k+2) = data_time; % time1970 data_reg_comb_var(1:n,k+3) = data_log(1:n,3); % status1971

1972 reg_var_a = regress(data_time,data_reg_comb_var(:,1:k));1973 reg_var_b = fitlm(data_reg_comb_var(:,1:k),data_time);1974 reg_var_c = ...

fitlm(data_reg_comb_var(:,1:k),data_time,'linear','RobustOpts','on');1975

177

1976

1977 comb_reg(i,k+1:2*k) = abs(reg_var_a');1978 % comb_reg(i,2*k+1:3*k) = ...

abs(reg_var_b.Coefficients(2:end,1));1979 % comb_reg(i,3*k+1:4*k) = ...

abs(reg_var_c.Coefficients(2:end,1));1980 end1981

1982 data_reg_cum = [];1983

1984 for i = 1:p-31985 data_reg_cum(i,1) = i;1986 data_reg_cum(i,2) = 0;1987 % data_reg_cum(i,3) = 0;1988 % data_reg_cum(i,4) = 0;1989 end;1990

1991 for i = 1:comb_reg_row1992 for j = 1:comb_reg_col % = k1993 data_reg_cum(comb_reg(i,j),2) = ...

data_reg_cum(comb_reg(i,j),2) + ...comb_reg(i,1*k+j); % cumulation of all ...regression type 1

1994 % data_reg_cum(comb_reg(i,j),3) = ...data_reg_cum(comb_reg(i,j),3) + ...comb_reg(i,2*k+j); % cumulation of all ...regression type 2

1995 % data_reg_cum(comb_reg(i,j),4) = ...data_reg_cum(comb_reg(i,j),4) + ...comb_reg(i,3*k+j); % cumulation of all ...regression type 3

1996 end1997 end1998

1999 figure;2000 bar(data_reg_cum(:,2),'k','grouped');2001 hold on; grid off;2002 title('cumulative clustering aft ...

regression','FontSize',15); hold on;2003 xlabel('variable','FontSize',15); ...


2005 figure;2006 bar(data_reg_cum(:,3),'b','grouped');2007 hold on; grid off;2008 title('cumulative clustering aft ...


ylabel('y','FontSize',15); hold on;

178

2010

2011 figure;2012 bar(data_reg_cum(:,4),'r','grouped');2013 hold on; grid off;2014 title('cumulative clustering aft ...



2017 figure;2018 bar(data_reg_cum(:,2:4),'grouped');2019 hold on; grid off;2020 title('cumulative clustering aft ...



2023 figure;2024 bar(data_reg_cum(:,2:4),'stacked');2025 hold on; grid off;2026 title('cumulative clustering aft ...



2029 data_comb_nptest_cum_norm = ...9*((data_comb_nptest_cum(:,2) - ...min(data_comb_nptest_cum(:,2)))/(max(data_comb_nptest_cum(:,2)) ...- min(data_comb_nptest_cum(:,2)))) + 1; % 1 to 10 ...score normalized for overal plot at the end of part 10

2030

2031

2032 [b,bint,r,rint,stats] = regress(data_time,data_var);2033

2034 % lm_unc = fitlm(data_unc(1:n,4:p),data_unc(1:n,2));2035 % rlm_unc = ...

fitlm(data_unc(1:n,4:p),data_unc(1:n,2),'linear','RobustOpts','on');2036

2037 lm_unc = fitlm(data_cens(1:n,4:p),data_cens(1:n,2));2038 rlm_unc = ...

fitlm(data_cens(1:n,4:p),data_cens(1:n,2),'linear','RobustOpts','on');2039 lm_var = fitlm(data_var,data_log(1:n,2));2040

2041 figure;2042 bar(x_axis,abs(b),'k'); hold on; grid on; % plot(b); ...

hold on; grid on;2043 title('multiple linear regression mode','FontSize',15); ...


179


2046 % figure;2047 % plot(lm_unc); hold on; grid on;2048 % figure;2049 % plot(lm_var); hold on; grid on;2050

2051 figure;2052 plot((b),'k');2053 hold on; grid on;2054 title('cumulation of all variable scores - ...



2057 plot((lm_unc.Coefficients(2:end,1)),'b');2058 hold on; grid on;2059 title('cumulation of all variable scores - ...



2062

2063 plot((rlm_unc.Coefficients(2:end,1)),'r');2064

2065 end % if part_code(16) == 12066

2067

2068 %% ...========================================================================

2069 %% ---------- part 19: aft ...------------------------------------------------

2070 % ...=========================================================================

2071 %%2072 if part_code(19) == 12073

2074 % ...http://www.mathworks.com/matlabcentral/fileexchange/38118-accelerated-failure-time--aft--models/content/aft%20models/aft.m

2075

2076 data_status = data_cens(:,3);2077 distr = 'weibull';2078 init = [];2079 % [pars covpars SE CI Zscores pvalues gval exitflag ...

hess] = ...aft(data_time,data_raw,dat_status,distr,init,minimizer,options,printresults);

2080 [pars covpars SE CI Zscores pvalues gval exitflag hess] ...= aft(data_time,data_raw,data_status,distr,init,1,1,1);

2081

180

2082 end % if part_code(19) == 12083

2084 %% ...========================================================================

2085 %% ---------- part 20: plots and charts ...-----------------------------------

2086 % ...=========================================================================

2087 %%2088 if part_code(20) == 12089

2090 % norm: (max - min)/range = (upper - lower)/new range2091

2092 x_data = data_select_nptest_cum(:,2); % part 102093 x_data_norm = 1*((x_data - min(x_data))/(max(x_data) - ...

min(x_data))) + 0; % max = 1,min = 02094 x_data_scatter = 20*((x_data - min(x_data))/(max(x_data) ...

- min(x_data))) + 0; % max = 20,min = 02095

2096 y_data = data_select_time_cum(:,2); % part 122097 y_data_norm = 1*((y_data - min(y_data))/(max(y_data) - ...

min(y_data))) + 0; % max = 1,min = 02098 y_data_scatter = 20*((y_data - min(y_data))/(max(y_data) ...

- min(y_data))) + 0; % max = 20,min = 02099

2100 z_data = data_nptest_elem_freq'; % part 14 ...(∆_wilcoxon_surv);

2101 % z_data = data_weight_cum(:,2); % part 152102 z_data_norm = 1*((z_data - min(z_data))/(max(z_data) - ...

min(z_data))) + 0; % max = 1,min = 02103 z_data_scatter = 01*((z_data - min(z_data))/(max(z_data) ...

- min(z_data))) + 1; % max = 1.1,min = 12104

2105 plotter_scat = ...[x_data_scatter,y_data_scatter,z_data_scatter];

2106 plotter_norm = [x_data_norm,y_data_norm,z_data_norm];2107

2108

2109 figure;2110 for i = 1:p-32111 x_data = plotter_scat(i,1);2112 y_data = plotter_scat(i,2);2113 r = plotter_scat(i,3);2114

2115 % for ang = 0:0.05:2*pi;2116 % xp = r*cos(ang);2117 % yp = r*sin(ang);2118 % plot(x+xp,y+yp,'Color',[0.1 0.1 ...

0.1],'LineWidth',2); hold on;

181

2119 % end2120

2121 % plot(1,1,'.r','MarkerSize',100)2122

2123 px = x_data-r;2124 py = y_data-r;2125 % h = rectangle('Position',[px py r*2 ...

r*2],'Curvature',[1,1],'FaceColor',[0 rand rand]);2126 % h = rectangle('Position',[px py r*2 ...

r*2],'Curvature',[1,1],'FaceColor',[0 ...(x_data_norm(i,1) + y_data_norm(i,1))/2 ...(x_data_norm(i,1) + y_data_norm(i,1))/2]);

2127 h = rectangle('Position',[px py r*2 ...r*2],'Curvature',[1,1],'FaceColor',[(x_data_norm(i,1) ...+ y_data_norm(i,1))/2 (x_data_norm(i,1) + ...y_data_norm(i,1))/2 (x_data_norm(i,1) + ...y_data_norm(i,1))/2]);

2128 % h = rectangle('Position',[px py r*2 ...r*2],'Curvature',[1,1],'FaceColor',[0 ...(x_data_norm(i,1) + y_data_norm(i,1) + ...r_data_norm(i,1))/3 (x_data_norm(i,1) + ...y_data_norm(i,1) + r_data_norm(i,1))/3]);

2129 daspect([1,1,1]);2130

2131 % text(x+px,y+py,['\fontsize12','\color[rgb].5 .0 ....0',num2str(data_comb_nptest_cum(i,1))],'FontSize',5); ...hold on; % 'FontWeight','bold'

2132

2133 hold on; grid on;2134 end2135

2136 for i = 1:p-32137 x_data = plotter_scat(i,1);2138 y_data = plotter_scat(i,2);2139

2140 text(x_data,y_data,['\fontsize12','\color[rgb]1 1 ...1',num2str(data_comb_nptest_cum(i,1))],'FontSize',5); ...hold on; % 'FontWeight','bold'

2141

2142 hold on; grid on;2143 end2144

2145 axis([min(plotter_scat(:,1))-5 max(plotter_scat(:,1))+5 ...min(plotter_scat(:,2))-5 max(plotter_scat(:,2))+5]);

2146 hold on; grid on;2147 % title('balloon scatter plot (a)','FontSize',15); hold on2148 xlabel('criterion a','FontSize',15); ylabel('criterion ...

b','FontSize',15); hold on;2149 set(gca,'FontSize',15);

182

2150

2151 figure;2152 scatter(x_data_norm,y_data_norm,fix(500*(z_data_norm+.1)),rand([p-3,3]),'fill','MarkerEdgeColor','k','LineWidth',0.5); ...

% http://www.mathworks.com/help/matlab/ref/scatter.html2153 axis([min(plotter_norm(:,1))-0.1 ...

max(plotter_norm(:,1))+0.1 min(plotter_norm(:,2))-0.1 ...max(plotter_norm(:,2))+0.1]);

2154 hold on; grid on;2155 scatter_labels = cellstr(num2str((1:p-3)'));2156 text(x_data_norm,y_data_norm,scatter_labels); % 0.1 ...

displacement so the text does not overlay the data points2157 title('balloon scatter plot (b)','FontSize',15); hold on;2158 xlabel('criterion 1','FontSize',15); ylabel('criterion ...

2','FontSize',15); hold on;2159 set(gca,'FontSize',15);2160

2161 % ...---------------------------------------------------------------------

2162

2163 figure;2164 % bar(x_axis,1:p-3,plotter_norm,0.5,'stack'); % ...

http://www.mathworks.com/help/matlab/ref/bar.html2165 hold on; grid on;2166 title('3D bar','FontSize',15); hold on;2167 xlabel('criterion 1','FontSize',15); ylabel('criterion ...

2','FontSize',15); hold on;2168 legend('criterion 1','criterion 2','criterion 3');2169 set(gca,'FontSize',15);2170

2171 figure;2172 % bar3(plotter_norm,0.5);2173 bar3(1:p-3,plotter_norm,0.5); % ...

http://www.mathworks.com/help/matlab/ref/bar3.html2174 hold on; grid on;2175 title('3D bar','FontSize',15); hold on;2176 xlabel('criterion 1','FontSize',15); ylabel('criterion ...

2','FontSize',15); hold on;2177 % axis([0 13 0 12 0 80])2178 set(gca,'FontSize',15);2179

2180 figure;2181 % labels = strcat();2182 pie_labels = cellstr(num2str((1:p-3)'));2183 % pie(y_data,ones(1,p-3),pie_labels); % ...

http://www.mathworks.com/help/matlab/ref/pie.html2184 hold on; grid on;2185

2186 % str = 'Item A: ';'Item B: ';'Item C: '; % text2187 % combinedstrings = strcat(str,percentValues); % text ...

183

and percent values2188 % oldExtents_cell = get(hText,'Extent'); % cell array2189 % oldExtents = cell2mat(oldExtents_cell); % numeric array2190 % set(hText,'String',combinedstrings);2191 % newExtents_cell = get(hText,'Extent'); % cell array2192 % newExtents = cell2mat(newExtents_cell); % numeric array2193 % width_change = newExtents(:,3)-oldExtents(:,3);2194 % signValues = sign(oldExtents(:,1));2195 % offset = signValues.*(width_change/2);2196 % textPositions_cell = get(hText,'Position'); % cell array2197 % textPositions = cell2mat(textPositions_cell); % ...

numeric array2198 % textPositions(:,1) = textPositions(:,1) + offset; % ...

add offset2199 % set(hText,'Position',num2cell(textPositions,[3,2])) ...

% set new position2200

2201 title('pie chart','FontSize',15); hold on;2202 xlabel('criterion 1','FontSize',15); ylabel('criterion ...


2205 figure;2206 area(plotter_norm); % ...

http://www.mathworks.com/help/matlab/ref/area.html2207 hold on; grid on;2208 title('area plot','FontSize',15); hold on;2209 xlabel('criterion 1','FontSize',15); ylabel('criterion ...

2','FontSize',15); hold on;2210 % legend(groups,'Location','NorthWest');2211 legend('Location','NorthWest');2212 set(gca,'FontSize',15);2213 % set(h(1),'FaceColor',[0,0.25,0.25]);2214 % set(h(2),'FaceColor',[0,0.5,0.5]);2215 % set(h(3),'FaceColor',[0,0.75,0.75]);2216 colormap winter;2217

2218 figure;2219 scatter3(x_data_norm,y_data_norm,z_data_norm,100,rand([p-3,3]),'fill','MarkerEdgeColor','k','LineWidth',0.5); ...

% http://www.mathworks.com/help/matlab/ref/scatter3.html2220 hold on; grid on;2221 scatter3_labels = cellstr(num2str((1:p-3)'));2222 text(x_data_norm,y_data_norm,z_data_norm,scatter3_labels); ...

% 0.1 displacement so the text does not overlay the ...data points

2223 title('3D scatter (a)','FontSize',15); hold on;2224 xlabel('criterion 1','FontSize',15); ylabel('criterion ...

2','FontSize',15); zlabel('criterion ...3','FontSize',15); hold on;

184

2225 set(gca,'FontSize',15);2226

2227 figure;2228 scatter3(x_data_norm,y_data_norm,z_data_norm,100,[x_data_norm ...

y_data_norm ...z_data_norm],'fill','MarkerEdgeColor','k','LineWidth',0.5); ...% http://www.mathworks.com/help/matlab/ref/scatter3.html

2229 hold on; grid on;2230 scatter3_labels = cellstr(num2str((1:p-3)'));2231 text(x_data_norm,y_data_norm,z_data_norm,scatter3_labels); ...


2232 title('3D scatter (b)','FontSize',15); hold on;2233 xlabel('criterion 1','FontSize',15); ylabel('criterion ...



2236 figure;2237 scatter3(x_data_norm,y_data_norm,z_data_norm,fix(500*(z_data_norm+.1)),[zeros(p-3,1) ...

(x_data_norm + y_data_norm + z_data_norm)/3 ...(x_data_norm + y_data_norm + ...z_data_norm)/3],'fill','MarkerEdgeColor','k','LineWidth',0.5); ...% http://www.mathworks.com/help/matlab/ref/scatter3.html

2238 hold on; grid on;2239 scatter3_labels = cellstr(num2str((1:p-3)'));2240 text(x_data_norm,y_data_norm,z_data_norm,scatter3_labels); ...


2241 title('3D scatter (c)','FontSize',15); hold on;2242 xlabel('criterion 1','FontSize',15); ylabel('criterion ...



2245 % make a color index for the ozone levels2246 % nc = 16;2247 % offset = 1;2248 % c = response - min(response);2249 % c = round((nc-1-2*offset)*c/max(c)+1+offset);2250

2251 figure;2252 radarplot(plotter_norm);2253 hold on; grid on;2254 title('radarplot','FontSize',15); hold on;2255 xlabel('criterion 1','FontSize',15); ylabel('criterion ...


185

2258 figure;2259 [x_data,y_data] = meshgrid(-3:.5:3,-3:.1:3); % ...

http://www.mathworks.com/help/matlab/ref/ribbon.html2260 z = peaks(x_data,y_data);2261 ribbon(y_data,z);2262 hold on; grid on;2263 xlabel('X');2264 ylabel('Y');2265 zlabel('Z');2266 title('ribbon','FontSize',15); hold on;2267 xlabel('criterion 1','FontSize',15); ylabel('criterion ...

2','FontSize',15); hold on;2268 set(gca,'FontSize',15);2269 colormap hsv;2270

2271 % ...http://www.mathworks.com/matlabcentral/fileexchange/35304-matlab-plot-gallery-ribbon-plot/content/html/Ribbon_Plot.html

2272 pi=3.142273 figure;2274 radar_redius = 20;2275 radar_segments = p-3;2276 radar_labels = cellstr(num2str((1:p-3)'));2277 for i = 1:radar_segments2278 line([0 radar_redius*cos(i*2*pi/radar_segments)]', ...

[0 ...radar_redius*sin(i*2*pi/radar_segments)]','Color',[0.5,0.5,0.5]);

2279 % ...text((radar_redius+2)*cos(i*2*pi/radar_segments),(radar_redius+2)*sin(i*2*pi/radar_segments),radar_labels(i,1));

2280 text((radar_redius+2)*cos(i*2*pi/radar_segments)-0.5,(radar_redius+2)*sin(i*2*pi/radar_segments),['\fontsize12','\color[rgb]0.25 ...0.25 ...0.25',num2str(data_comb_nptest_cum(i,1))],'FontSize',5); ...hold on; % 'FontWeight','bold'

2281 end2282 hold on;2283 for i = 1:radar_segments2284 for j = 2:radar_redius2285 line([j*cos(i*2*pi/radar_segments) ...

j*cos((i+1)*2*pi/radar_segments)]', ...[j*sin(i*2*pi/radar_segments) ...j*sin((i+1)*2*pi/radar_segments)]','Color',[0.5,0.5,0.5]);

2286 end2287 end2288 hold on;2289 for i = 1:radar_segments2290 x_radar(i,:) = ...

[radar_redius*x_data_norm(i,1)*cos(i*2*pi/radar_segments) ...radar_redius*x_data_norm(i,1)*sin(i*2*pi/radar_segments)];

2291 y_radar(i,:) = ...[radar_redius*y_data_norm(i,1)*cos(i*2*pi/radar_segments) ...

186

radar_redius*y_data_norm(i,1)*sin(i*2*pi/radar_segments)];2292 z_radar(i,:) = ...

[radar_redius*z_data_norm(i,1)*cos(i*2*pi/radar_segments) ...radar_redius*z_data_norm(i,1)*sin(i*2*pi/radar_segments)];

2293 end2294 fill(x_radar(:,1),x_radar(:,2),[1,0.4,0.6],'FaceAlpha',0.4);2295 hold on;2296 fill(y_radar(:,1),y_radar(:,2),[0,0.5,0.5],'FaceAlpha',0.4);2297 hold on;2298 fill(z_radar(:,1),z_radar(:,2),[1,0.9,0.5],'FaceAlpha',0.4);2299 hold on;2300 % (gcf,'Color',[1,0.4,0.6])2301 axis square;2302 axis off;2303

2304 end % if part_code(20) == 12305

2306 %% ...========================================================================

2307 %% ---------- part 21: fuzzy time score - random subset ...selection ---------

2308 % ...=========================================================================

2309 %%2310 if part_code(21) == 12311

2312 % data_fuzzy = data_cal(1:n,4:p);2313 m = 1000; % no. of randomly selected sets2314 k = 3; % no. of variable combination out of all variables2315 select = [];2316 data_select_xxx = [];2317 data_select_xxx(1:p-3,1:2) = 0;2318


2323 sum_limit_effect_var = k;2324

2325 for i = 1:m % trials2326

2327 b_fuzzy = [];2328 data_select = []; % instructing survival data from ...

set #i2329 data_select = data_fuzzy(:,select(i,1:k)); % for set #i2330 b_fuzzy = regress(data_time,data_select);2331

2332 for j = 1:k2333 data_select_xxx(select(i,j),1) = ...

187

data_select_xxx(select(i,j),1) + ...b_fuzzy(j,1); % cumulation of all scores for ...all variables

2334 data_select_xxx(select(i,j),2) = ...data_select_xxx(select(i,j),2) + ...abs((b_fuzzy(j,1))^2); % cumulation of all ...scores for all variables

2335 end2336 end2337

2338 figure;2339 bar(zscore(data_select_xxx(:,1)),'b','grouped');2340 hold on; grid off;2341 title('fuzzy time scores - ...

randomization','FontSize',15); hold on;2342 xlabel('variable','FontSize',15); ylabel('fuzzy time ...

scores - randomization','FontSize',15); hold on;2343

2344 figure;2345 bar( (data_select_xxx(:,2) - min(data_select_xxx(:,2)) ...

)/(max(data_select_xxx(:,2)) + ...min(data_select_xxx(:,2)) ...),'FaceColor',[0.1,0.2,0.3],'EdgeColor',[0.1,0.1,0.1]);

2346 % ...bar((data_select_xxx(:,2)),'FaceColor',[0.4,0.4,0.4],'EdgeColor',[0.1,0.1,0.1]);

2347 hold on; grid off;2348 % title('fuzzy time scores - ...

randomization','FontSize',15); hold on;2349 xlabel('variable','FontSize',15); ylabel('normalized ...

coefficient score','FontSize',15); hold on;2350

2351 load 'figs';2352 figure;2353 s(1) = subplot(3,1,1);2354 bar( (figs(:,1) - min(figs(:,1)) )/(max(figs(:,1)) + ...

min(figs(:,1)) ...),'FaceColor',[0.3,0.3,0.3],'EdgeColor',[0.1,0.1,0.1]);

2355 hold on; grid off;2356 xlabel('variable','FontSize',15); ylabel('normalized ...

coefficient score','FontSize',15); hold on;2357 s(2) = subplot(3,1,2);2358 bar( (figs(:,2) - min(figs(:,2)) )/(max(figs(:,2)) + ...



coefficient score','FontSize',15); hold on;2361 s(3) = subplot(3,1,3);2362 bar( (figs(:,5) - min(figs(:,5)) )/(max(figs(:,5)) + ...

188



coefficient score','FontSize',15); hold on;2365

2366

2367 end % if part_code(21) == 12368

2369 %% ...========================================================================

2370 %% ---------- part 22: fuzzy weighted standardized score ...method -----------

2371 % ...=========================================================================

2372 %%2373 if part_code(22) == 12374

2375 m = 100; % no. of trials2376 k = 3; % no. of cluster members2377 var_num = p-3;2378 % clus_nu = ((p-3)- mod(p-3,k))/k;2379 clus_nu = fix((p-3)/k); %2380

2381 var_vector = [];2382 var_vector = data_var_num(end,:);2383 var_rand = [];2384


2391 for i = 1:m % trials2392

2393 var_rand = [];2394 greedy(clus_nu) = ...

struct('var_num',[],'data',[],'var',[],'test',[]);2395




2401 for j = 1:clus_nu2402 greedy(j).var_num = var_rand(k*(j-1)+1:k*j);

189

2403 end2404


2406


size(greedy(j).var_num);2409 kk = greedy_var_num_col; % is equal to k for ...


2410 greedy(j).data = []; % cluster #j2411 greedy(j).data = ...

data_var(:,greedy(j).var_num(1,1:kk)); % ...costructing cluster #j (var_num(1,1:k))

2412 greedy(j).data(:,kk+1) = sum(greedy(j).data,2);2413 greedy(j).data(1:n,kk+2) = data_time;2414

2415 greedy(j).var = []; % costructing survival data ...from cluster #j based on element k+1 (for ...time t,select if at least ..."sum_limit_effect_var" variables at 1)

2416 greedy(j).var = ...greedy(j).data(find(greedy(j).data(:,kk+1)>sum_limit_effect_var),kk+2); ...% remove ineffective failure times from ...cluster #j

2417 greedy(j).var(:,2) = 1; % just as a frequency of ...failure time in order to plot kaplan-meier ...survival function

2418

2419 [greedy(j).test(1,1),greedy(j).test(1,2)] = ...ranksum(data_time,greedy(j).var(:,1)); % ...wilcoxon rank sum test (col k+1 and k+2)

2420 greedy(j).test(1,3) = ...logrank(data_time,greedy(j).var(:,1)); % ...logrank test (col k+3)

2421

2422 greedy_result(1,j) = j;2423 greedy_result(2,j) = greedy(j).test(1,1);2424

2425 end2426

2427 [max_num,max_index] = max(greedy_result(2,:));2428

2429 var_rand = greedy(greedy_result(1,max_index)).var_num;2430 data_nptest = [data_nptest; var_rand];2431


190




2438 end2439



2442






2448






2454


2457 end % if part_code(22) == 12458

2459 %% ...========================================================================

2460 %% ---------- part 23: M-III1 ...---------------------------------------------

2461 % ...=========================================================================

2462 %%2463 if part_code(23) == 12464

191

2465 m = 100; % no. of trials2466 k = 4; % no. of cluster members2467 kmean_clus_num = 5;2468 var_num = p-3;2469 % clus_nu = ((p-3)- mod(p-3,k))/k;2470 clus_nu = fix((p-3)/k); %2471

2472 data_trans = [];2473 kmean_result = [];2474 potentil_function = [];2475 cost_function = [];2476

2477 var_rand = [];2478 var_vector = [];2479 var_vector = data_var_num(end,:);2480

2481 for i = 1:m % trials2482

2483 var_rand = [];2484 greedy(i,clus_nu) = struct('var_num',[],'data',[]);2485 data_trans = [];2486 idx = [];2487 C = [];2488 sumd = [];2489 D = [];2490 kmean_result = [];2491 potentil_function = [];2492 cost_function(i,1) = 0;2493



2497 for j = 1:clus_nu2498 greedy(i,j).var_num = var_rand(k*(j-1)+1:k*j);2499 end2500


2502


size(greedy(i,j).var_num);2505 kk = greedy_var_num_col; % is equal to k for ...


2506 greedy(i,j).data = []; % cluster #j2507 greedy(i,j).data = ...

192

data_fuzzy(:,greedy(i,j).var_num(1,1:kk)); % ...constructing cluster #j (var_num(1,1:k))

2508 greedy(i,j).data(:,kk+1) = mean(greedy(i,j).data,2);2509 greedy(i,j).data(1:n,kk+2) = data_time;2510

2511 data_trans(:,j) = greedy(i,j).data(:,kk+1);2512 end2513

2514 [idx,C,sumd,D] = kmeans(data_trans,kmean_clus_num);2515

2516 % opts = statset('Display','final');2517 % [cidx_raw,ctrs_log] = ...

kmeans(data_trans,5,'Distance','cityblock','Replicates',5,'Options',opts);2518

2519 kmean_result = horzcat(data_time,idx,data_trans);2520

2521 for j = 1:kmean_clus_num2522 cluster_kmean = ...

kmean_result(find(kmean_result(:,2)==j),:);2523 potentil_function(j,1) = mean(cluster_kmean(:,1));2524 potentil_function(j,2) = mean(cluster_kmean(:,2));2525 end2526

2527 for ii = 1:kmean_clus_num-12528 for jj = ii+1:kmean_clus_num2529 cost_function(i,1) = cost_function(i,1) + ...

abs(potentil_function(ii,1)-potentil_function(jj,1));2530 cost_function(i,2) = i;2531 end2532 end2533


2536 cost_function_sort = sortrows(cost_function,1);2537 largest_cost_trial = cost_function_sort(end,2);2538

2539 greedy(largest_cost_trial,:).var_num2540




2545 end2546


2548 [data_nptest_elem_freq,data_nptest_elem] = ...

193

hist(data_nptest_cum,data_nptest_cum_unique); % ...elements in data_nptest_cum and their frequencies

2549






2555






2561


2564 end % if part_code(23) == 12565

2566 %% ...========================================================================

2567 %% ---------- part 24: M-III2 ...---------------------------------------------

2568 % ...=========================================================================

2569 %%2570 if part_code(24) == 12571

2572 m = 10; % no. of trials2573 k = 4; % no. of cluster members2574 kmean_clus_num = 5;2575 var_num = p-3;2576 % clus_nu = ((p-3)- mod(p-3,k))/k;2577 clus_nu = fix((p-3)/k); %2578

2579 data_trans = [];2580 kmean_result = [];2581 potentil_function = [];2582 cost_function = [];

194

2583

2584 var_rand = [];2585 var_vector = [];2586 var_vector = data_var_num(end,:);2587

2588 for i = 1:m % trials2589

2590 var_rand = [];2591 greedy(i,clus_nu) = struct('var_num',[],'data',[]);2592 data_trans = [];2593 idx = [];2594 C = [];2595 sumd = [];2596 D = [];2597 kmean_result = [];2598 potentil_function = [];2599 cost_function(i,1) = 0;2600



2604 for j = 1:clus_nu2605 greedy(i,j).var_num = var_rand(k*(j-1)+1:k*j);2606 end2607


2609


size(greedy(i,j).var_num);2612 kk = greedy_var_num_col; % is equal to k for ...


2613 greedy(i,j).data = []; % cluster #j2614 greedy(i,j).data = ...

data_fuzzy(:,greedy(i,j).var_num(1,1:kk)); % ...constructing cluster #j (var_num(1,1:k))

2615 greedy(i,j).data(:,kk+1) = mean(greedy(i,j).data,2);2616 greedy(i,j).data(1:n,kk+2) = data_time;2617

2618 data_trans(:,j) = greedy(i,j).data(:,kk+1);2619 end2620

2621 [idx,C,sumd,D] = kmeans(data_trans,kmean_clus_num);2622

2623 % opts = statset('Display','final');

195

2624 % [cidx_raw,ctrs_log] = ...kmeans(data_trans,5,'Distance','cityblock','Replicates',5,'Options',opts);

2625

2626 kmean_result = horzcat(data_time,idx,data_trans);2627

2628 for j = 1:kmean_clus_num2629 cluster_kmean = ...

kmean_result(find(kmean_result(:,2)==j),:);2630 potentil_function(j,1) = mean(cluster_kmean(:,1));2631 potentil_function(j,2) = mean(cluster_kmean(:,2));2632 end2633

2634 for ii = 1:kmean_clus_num-12635 for jj = ii+1:kmean_clus_num2636 cost_function(i,1) = cost_function(i,1) + ...

abs(potentil_function(ii,1)-potentil_function(jj,1));2637 cost_function(i,2) = i;2638 end2639 end2640


2643 cost_function_sort = sortrows(cost_function,1);2644 largest_cost_trial = cost_function_sort(end,2);2645

2646 greedy(largest_cost_trial,:).var_num2647




2652 end2653



2656





196


2662






2668


2671 end % if part_code(24) == 12672

2673 %% ...========================================================================

2674 %% ---------- end of parts ...------------------------------------------------

2675 % ...=========================================================================

197

Analytic for data-driven decision-making in complex high-dimensional ...rx917b36b/... · Analytic...

Documents

Transcript of Analytic for data-driven decision-making in complex high-dimensional ...rx917b36b/... · Analytic...