Comparing Multi-class, Binary and Hierarchical Machine Learning … · 2019-10-17 · MNRAS...

download Comparing Multi-class, Binary and Hierarchical Machine Learning … · 2019-10-17 · MNRAS 000,1{16(2019) Preprint 17 July 2019 Compiled using MNRAS LATEX style le v3.0 Comparing

If you can't read please download the document

Transcript of Comparing Multi-class, Binary and Hierarchical Machine Learning … · 2019-10-17 · MNRAS...

  • The University of Manchester Research

    Comparing Multi-class, Binary and Hierarchical MachineLearning Classication schemes for variable starsDOI:10.1093/mnras/stz1999

    Document VersionAccepted author manuscript

    Link to publication record in Manchester Research Explorer

    Citation for published version (APA):Hosenie, Z., Lyon, R., Stappers, B., & Mootoovaloo, A. (2019). Comparing Multi-class, Binary and HierarchicalMachine Learning Classication schemes for variable stars. Monthly Notices of the Royal Astronomical Society.https://doi.org/10.1093/mnras/stz1999

    Published in:Monthly Notices of the Royal Astronomical Society

    Citing this paperPlease note that where the full-text provided on Manchester Research Explorer is the Author Accepted Manuscriptor Proof version this may differ from the final Published version. If citing, it is advised that you check and use thepublisher's definitive version.

    General rightsCopyright and moral rights for the publications made accessible in the Research Explorer are retained by theauthors and/or other copyright owners and it is a condition of accessing publications that users recognise andabide by the legal requirements associated with these rights.

    Takedown policyIf you believe that this document breaches copyright please refer to the University of Manchester’s TakedownProcedures [http://man.ac.uk/04Y6Bo] or contact [email protected] providingrelevant details, so we can investigate your claim.

    Download date:02. Oct. 2020

    https://doi.org/10.1093/mnras/stz1999https://www.research.manchester.ac.uk/portal/en/publications/comparing-multiclass-binary-and-hierarchical-machine-learning-classication-schemes-for-variable-stars(e2a6f451-5c4e-4a77-986e-0c678dad42be).html/portal/ben.stappers.htmlhttps://www.research.manchester.ac.uk/portal/en/publications/comparing-multiclass-binary-and-hierarchical-machine-learning-classication-schemes-for-variable-stars(e2a6f451-5c4e-4a77-986e-0c678dad42be).htmlhttps://www.research.manchester.ac.uk/portal/en/publications/comparing-multiclass-binary-and-hierarchical-machine-learning-classication-schemes-for-variable-stars(e2a6f451-5c4e-4a77-986e-0c678dad42be).htmlhttps://doi.org/10.1093/mnras/stz1999

  • MNRAS 000, 1{16 (2019) Preprint 17 July 2019 Compiled using MNRAS LATEX style �le v3.0

    Comparing Multi-class, Binary and Hierarchical MachineLearning Classi�cation schemes for variable stars

    Za�irah Hosenie,1? Robert Lyon,1 Benjamin Stappers,1 Arrykrishna Mootoovaloo21Jodrell Bank Centre for Astrophysics, School of Physics and Astronomy, The University of Manchester, Manchester M13 9PL, UK.2Imperial Centre for Inference and Cosmology (ICIC), Imperial College, Blackett Laboratory, Prince Consort Road, London SW7 2AZ, UK.

    Accepted 2019 July 15. Received 2019 July 15; in original form 2019 March 06

    ABSTRACTUpcoming synoptic surveys are set to generate an unprecedented amount of data. Thisrequires an automatic framework that can quickly and e�ciently provide classi�cationlabels for several new object classi�cation challenges. Using data describing 11 typesof variable stars from the Catalina Real-Time Transient Surveys (CRTS), we illustratehow to capture the most important information from computed features and describedetailed methods of how to robustly use Information Theory for feature selection andevaluation. We apply three Machine Learning (ML) algorithms and demonstrate howto optimize these classi�ers via cross-validation techniques. For the CRTS dataset,we �nd that the Random Forest (RF) classi�er performs best in terms of balanced-accuracy and geometric means. We demonstrate substantially improved classi�cationresults by converting the multi-class problem into a binary classi�cation task, achievinga balanced-accuracy rate of �99 per cent for the classi�cation of �-Scuti and AnomalousCepheids (ACEP). Additionally, we describe how classi�cation performance can beimproved via converting a ‘at-multi-class’ problem into a hierarchical taxonomy. Wedevelop a new hierarchical structure and propose a new set of classi�cation features,enabling the accurate identi�cation of subtypes of cepheids, RR Lyrae and eclipsingbinary stars in CRTS data.

    Key words: stars: variables- general { methods: data analysis - Astronomical instru-mentation, methods, and techniques.

    1 INTRODUCTION

    Astronomy has experienced an increase in the volume, qual-ity and complexity of datasets produced during numericalsimulations and surveys. One factor that contributes to thedata avalanche is the new generation of synoptic sky sur-veys, for example, the Catalina Real-Time Transient Surveys(CRTS) (Drake et al. 2017). In addition, the Large SynopticSurvey Telescope (LSST, Ivezic et al. (2008)) for example,which is now on the horizon, will produce � 15 Terabytesof raw data per night (Juric et al. 2015). However, despitethis data deluge, source variability is often still visually in-spected to detect new promising candidates/variable stars.Visual inspection does have utility for detection and classi-�cation. Human experts can extract new useful informationdespite unevenly sampled data sets and also have the abilityto distinguish noisy data from data exhibiting interesting be-haviour/characteristics. They can also incorporate complexcontextual information into their decision making. However,

    ? E-mail: za�[email protected]

    the e�cacy of the manual approach decreases as the volumeof data grows exponentially, as will be the case for the nextgeneration of surveys. Visual inspection becomes inconsis-tent, consequently, mistakes are made, and rare/interestingobjects can be missed.

    To address this problem, Machine Learning (ML) hasbeen applied to variable-star classi�cation in multiple time-series datasets (see Belokurov et al. 2003; Willemsen & Eyer2007). In ML, variable stars are represented by features:independent measures that contain information useful fordi�erentiating variable stars into their respective classes.Therefore, several developments have been made towardsdetermining the best methods and features for describ-ing variable-stars, including the Lomb-Scargle periodogram(Lomb 1976; Scargle 1982), Bayesian Evidence Estimation(Gregory & Loredo 1992) as well as hybrid methods (Saha &Vivas 2017). In addition, Eyer & Blake (2005) analysed thesmall sharp features of light curves and included them as in-put features to a Na�ve Bayes classi�er (Zhang 2004). WhileDjorgovski et al. (2016) developed an automatic frameworkto detect and classify transient events and variable stars.

    ' 2019 The Authors

  • 2 Z. Hosenie et al.

    18.0

    20.0

    Mag

    nitu

    de

    RRab: J000031.7-412854 14.5

    15.0

    15.5

    RRc: J000044.8-430758

    16.0

    17.0

    RRd: J000956.0-242445

    15.0

    16.0Mag

    nitu

    de

    Blazhko: J001108.0-593330 13.0

    14.0

    15.0

    Contact & Semi-Detached EB: J000025.8-393651

    14.0

    14.5

    Detached EB: J000111.5-223745

    16.0

    16.5

    Mag

    nitu

    de

    Rotational: J000733.1-29490312.0

    13.0

    14.0

    LPV: J001234.2-225517

    0.0 0.5 1.0 1.5 2.0Phase

    15.5

    16.0

    16.5

    d-Scuti: J012028.9-292610

    0.0 0.5 1.0 1.5 2.0Phase

    13.0

    13.5

    14.0

    Mag

    nitu

    de

    ACEP: J003041.3-441620

    0.0 0.5 1.0 1.5 2.0Phase

    13.0

    13.5

    14.0

    Cep-II: J004302.8-533428

    Figure 1. Examples of folded light curves from the CRTS for the various types of variable stars considered in our analyses.

    They used a subset of the CRTS data to perform classi�ca-tion between two types of variable stars (W Uma and RRLyrae) and obtained completeness rates of �96-97 per cent.

    Kim & Bailer-Jones (2016) developed the UPSILON pack-age to classify periodic variable stars using 16 extracted fea-tures from light curves, which achieves good results. Maha-bal et al. (2017) developed a classi�er based on the Convolu-tional Neural Network (CNN) model using labelled datasetsof periodic variables from the CRTS (Drake et al. 2009; Djor-govski et al. 2011; Mahabal et al. 2012; Djorgovski et al.2016). They transformed a light curve (time series) intoa two-dimensional mapping representation (dm � dt) whichis based on the changes in magnitude (dm) over the time-di�erence (dt). Using multi-class classi�cation, their algo-rithm achieved an accuracy of �83 per cent. Narayan et al.(2018) developed an ML approach to classify variable ver-sus transient stars. Similarly, they performed a multi-classclassi�cation of combined variable stars & transients, and a\purity-driven" sub-categorisation of the transient class us-ing multi-band optical photometry. Revsbech et al. (2018)used a data augmentation technique to mitigate the e�ectsof bias in their data by generating additional training datausing Gaussian Processes (GPs). They used a di�usion mapmethod that calculates a pair-wise distance matrix that out-puts di�usion map coe�cients of the light curves. These co-e�cients act as feature inputs to a Random Forest (RF)classi�er used to help identifying Type Ia supernova.

    We found that it is fundamentally important to develop

    accurate and robust automated classi�cation methods forthis problem using machine learning and other statisticalapproaches. This paper describes a new automatic classi-�cation pipeline for the classi�cation of variable stars viaapplication to archival data. To our knowledge, this is the�rst time the southern CRTS (Drake et al. 2017) data sethas been used to build/evaluate an automatic classi�cationsystem.

    Similar work has been completed in recent years (Kim &Bailer-Jones 2016; Mahabal et al. 2017; Narayan et al. 2018),though the features used for learning are rarely evaluated ina statistically rigorous way. We found that using a large setof features does not imply higher classi�cation metrics. Wetherefore perform an in-depth analysis of ML features tounderstand their information content, and determine whichgive rise to the best classi�cation performance. We utilizevarious visualization techniques and the tools of InformationTheory to achieve this.

    Based on our analyses we �nd that accurate variablestar classi�cation is possible with just seven features - muchfewer than in other works. In addition, we show that thisclassi�cation problem cannot be solved with a ‘at’ multi-class classi�cation approach, as the data is inherently imbal-anced. To partially alleviate the ‘imbalanced learning prob-lem’ (Last et al. 2017), we developed an approach inspired byearlier work in this area (Richards et al. 2011). This involvedconverting a standard multi-class problem in to a hierar-chical classi�cation problem, by aggregating sub-classes in

    MNRAS 000, 1{16 (2019)

  • Machine Learning Classi�cation for variable stars 3

    to super-classes. This results improved performance on rareclass examples typically misclassi�ed by multi-class meth-ods. We adopt a similar methodology to Richards et al.(2011), however we i) propose a di�erent hierarchical classi�-cation structure, ii) use a di�erent feature analysis/selectionmethodology resulting in di�erent feature choices, iii) applyhyper-parameter optimisation to build optimal classi�cationmodels, and �nally iv) apply the resulting approach to CRTSdata.

    The outline of this paper is as follows. In x2, we providea brief description of the dataset used and in x3 we presentthe feature generation techniques we employ here; while inx4 we explain how we build the classi�cation pipeline. In x5we apply state-of-the-art feature visualisation techniques tovisualise how separable our features are before performing amulti-class classi�cation. In x6, we provide an in-depth fea-ture evaluation to determine the usefulness of our extractedfeatures before performing a binary classi�cation. In x7, wepresent a hierarchical taxonomy for classi�cation and discussour results; �nally, we summarise our results and conclusionsin x8.

    2 DATA

    We use the publicly available CRTS (Drake et al. 2017)dataset that covers the sky with declinations between � 20�and � 75�. The sources in the dataset have median mag-nitudes in the range 11 < V < 19.5. The dataset con-tains di�erent forms of periodic variable stars, these arestars that undergo regular changes in brightness every fewhours or within a few days or weeks. The periodic variablestars in the dataset can generally be classi�ed into threebroad classes: namely eclipsing, pulsating, and rotational.The classes can be further divided into sub-types, for exam-ple pulsating stars consist of � Scutis, RR Lyrae, Cepheids,and the Long Period Variables (LPVs) group which includesboth semi-regular variables and Mira variable stars. The RRLyrae class consists of RRab’s (fundamental mode), RRc’s(�rst overtone mode), and RRd’s (multimode). However,many RR Lyrae stars are known to exhibit the Blazhkoe�ect (long-term modulation) (Blazhko 1907). In addition,the Cepheids include type-II Cepheids (Cep-II), AnomalousCepheids (ACEPs), and Classical Cepheids. The eclipsingbinary variables in the data are divided into detached bi-naries (EA) and contact plus semi-detached binaries (Ecl).The rotational class consists of variable stars including theellipsoidal variables (ELL) and spotted RS Canum Venati-corum (RS CVn) systems. A more detailed overview of thedata set is given in Drake et al. (2017) and Catelan & Smith(2015) gives a more detailed overview of the properties ofthe various types of pulsating stars.

    The dataset contains about 37,745 periodic variablestars (Catalina Surveys Data Release 2, 1 CSDR2). For ouranalysis, we use a sample of 37,437 out of the 37,745 starsfrom the CSDR2. We have excluded Type 11: Miscellaneousvariable stars (periodic stars that were di�cult to classifyas presented in Drake et al. (2017)) and Type 13: LMC-Cep-I which as a group consists of only 10 examples. We

    1 Catalina Surveys Data Release 2

    1.RR

    ab

    2.RR

    c

    3.RR

    d

    4.Bl

    azhk

    o5.

    Ecl

    6.EA

    7.Ro

    tatio

    nal

    8.LP

    V

    9.dSc

    uti

    10. A

    CEP

    12. C

    ep-II

    0

    1000

    2000

    3000

    4000

    Num

    ber

    ofS

    ampl

    es

    4325

    3752

    502171

    4509 4509

    3636

    1286

    147 153 153

    Figure 2. Class distributions for the CSDR2 datasets. We down-sample Type 5: semi-detached binary stars to 4,509 samples toprevent larger classes from dominating the training sets. The ex-cluded samples �14,294 for Type 5: Ecl are then included in thetest set for prediction. These distributions highlight the remainingimbalance in the datasets.

    remove the smallest classes as there are too few samples tocharacterise those classes, as we also know that classi�ersgiven such data will struggle to categorise them accuratelydue to the imbalanced learning problem (Last et al. 2017).Furthermore, we downsample Type 5: semi-detached binarystars to 4,509 samples as this class of object originally con-sisted of 18,803 samples. We perform this downsampling toprevent the large class from dominating the training sets,which could otherwise potentially bias a classi�er. The ex-cluded samples of Type 5 are then included in the test set.Fig. 1 shows examples of folded light curves for each classunder consideration. We also present the number of samplesconsidered for the 11 di�erent types of variable stars in Fig.2. Note that �14,294 samples of Type 5: Ecl, unused duringtraining, were eventually used in the test set.

    3 FEATURE GENERATION

    In general, machine learning algorithms use training datato build a mathematical model, where the training data iscomprised of feature data. These features may have varyingutility, that is, information rich features are desirable as theycan be used to build more accurate classi�cation systems.In this work, we generate statistical features from the lightcurves to characterize and distinguish di�erent variabilityclasses. Let S = fX1; : : : ; Xng, represent the set of variablestar data available to us, then Xi is an individual star repre-sented by variables known as f eatures. An arbitrary featureof star Xi is denoted by X

    ji , where j = 1; : : : ; l : Each variable

    star has an associated label y such that y 2 Y =(y1; : : : ; yk

    ).

    For multi-class scenarios the possible labels assignable tovariable stars vary as 1 � y � 12 but since we do not con-sider Type 11 examples for reasons given at the end of x2, wenote that y , 11. Meanwhile for binary class classi�cationscenarios, we consider binary labels y 2 Y = [0; 1].

    The goal is to build a machine learning algorithmthat learns to classify variable stars described by features,from a labelled input vector, also known as the trainingset, XTrain. The training set consists of pairs such that

    MNRAS 000, 1{16 (2019)

    http://nesssi.cacr.caltech.edu/DataRelease/VarcatS.html

  • 4 Z. Hosenie et al.

    Table 1. The seven features used as inputs to our classi�cation scheme. Six features based on simple statistics,are extracted directly from light curves using FATS. Note that the period feature is obtained from the Drake et al.(2017) catalog.

    Features Description Symbol

    Mean � = 1NPN

    i=1 mi where m is the magnitude and N is the number of data points. �

    Standard Deviation � =q

    1N�1

    PNi=1 (mi � �)

    2 �

    Skewness = N(N�1)(N�2)PN

    i=1

    � mi���

    �3

    Kurtosis kurt = N (N+1)(N�1)(N�2)(N�3)PN

    i=1

    � mi���

    �4� 3(N�1)

    2(N�2)(N�3)

    kurt

    Mean-variance This is a simple variability index and is de�ned as the ratio of the standarddeviation, �, to the mean magnitude, �.

    ��

    Period

    Drake et al. (2017) used the Lomb-Scargle (L-S) periodogram analysis to-gether with an Adaptive Fourier Decomposition (AFD) method to calculatethe period of unevenly sampled data. More information about this featurecan be found in Drake et al. (2017).

    T

    Amplitude The amplitude is half of the di�erence between the median of the maximum5% and the median of the minimum 5% magnitudes.

    Amp

    XTrain =��

    X1; y1�; : : : ;

    �Xn; yn

    �. The learnt mapping func-

    tion between input feature vectors and labels in XTrain, canthen be utilised to label new unseen stars, in XTest .

    In this work, we focus mainly on using the statisticalproperties of the data with no preconceived notions of theirsuitability or expressiveness as input features to our ML al-gorithms. As a result we focus on 7 features, of which 6are intrinsic statistical properties relating to location (meanmagnitude), scale (standard deviation), variability (meanvariance), morphology (skew, kurtosis, amplitude), and time(period). These features are highly interpretable, and robustagainst bias. Note that we remove data points from lightcurves that are 3� above or below the mean magnitude,where � is the standard deviation and it is an importantstep to remove any outliers in the data. This cleaning doesnot alter the light curves signi�cantly as it removes less than1 per cent of their data points.

    Afterwards, we used the FATS2 (Nun et al. 2015)Python Library to extract these features. FATS takes asinput the unfolded light curves and it outputs various sta-tistical features: the mean, standard deviation, skew, kurto-sis, mean-variance, and amplitude. We also incorporate theperiod for each star given in the catalog as a feature to ourML algorithms. The description of the input features usedfor classi�cation is listed in Table 1. Practitioners should becautious when applying the mean magnitude as a feature incombination with data obtained at another telescope. It hasthe potential to bias a training set against fainter/brightersources. We note that adding the telescope label as a featuremay overcome this issue, but we leave that to future work.

    One important aspect of the training process used tobuild a classi�cation model is data pre-processing. Some MLalgorithms, e.g functions/classi�ers that calculate the dis-tance between data points, will not work properly withoutnormalisation or standardisation since the range of values

    2 FATS: Feature Analysis for Time Series

    of the features/raw data varies widely. We therefore employa normalisation method, Ŝ, to standardize the feature datasuch that all values in the feature space are scaled between0 and 1.

    4 CLASSIFICATION PIPELINE

    In this section, we describe the process used to perform theclassi�cation of variable stars, for instance, training, hyper-parameter optimisation and prediction. We �rst extract fea-tures and then split the feature data into the training (70per cent) and the test (30 per cent) data. The training datais used to train the model while the test data is used to eval-uate the performance of the trained model. Once the data issplit into two, we perform a simple normalization where eachfeature X ji is divided by its maximum, max(X

    ji ) (see Eq. 1).

    The goal of normalisation is to ensure that all features usea common scale. This is bene�cial for ML algorithms thatare sensitive to feature distributions, for instance, distance-based algorithms that require Euclidean distance compu-tation (for e.g. k�Nearest Neighbours (kNN)). Some mod-els (e.g. Decision Tree (DT), Random Forest (RF)) are lesssensitive to feature scaling. However, it is good practice tostandardize when comparing between classi�cation systemsto rule out potential sources of disparity in our results. Notethat in this context, we found that this simple normalisa-tion was enough to yield good performance. For XTest , thefeatures are then rescaled using max(X jTraini ),

    XTrain =�������

    XTrainimax

    �XTraini

    ��������; XTest =

    �������

    XTestimax

    �XTraini

    ��������: (1)

    Afterwards, we apply some feature evaluation strategiesto check whether the features show some separability. Then,for our classi�cation purposes, we apply three di�erent clas-si�cation algorithms: DT, RF and kNN, accomplished withScikit-Learn (Pedregosa et al. 2011). kNN (Buturovic 1993)

    MNRAS 000, 1{16 (2019)

    http://isadoranun.github.io/tsfeat/FeaturesDocumentation.html

  • Machine Learning Classi�cation for variable stars 5

    !"#$%&’(%(

    )*+#*,-+../"01&2+(%$*+&3+4+-%/,"

    5#%/6/7(%/,"

    89(4$(%/,"

    :(*/(;4+&3%(*.&$*9+.?@@&AB#+.C

    2+(%$*+&8D%*(-%/,"2EA3

    3#4/%&’(%(A*(/"/"0&("F&A+.%&

    .+%

    2+(%$*+&G,*6(4/.(%/,"

    :/.$(4/7(%/,"! ’+"./%B&)4,%.! H,D&("F&I=/.J+*! %K3G8

    2+(%$*+&89(4$(%/,"! >,**+4(%/,"&A+.%! LM!&N&M!&$./"0&

    !"O,*6(%/,"&%=+,*B

    PB#+*K#(*(6+%+*&,#%/6/7(%/,"

    KQ("F,6/7+F&0*/F.+(*-=I/%=&-*,..K9(4/F(%/,"

    2/"(4&>4(../O/-(%/,"K;+.%&=B#+*K#(*(6+%+*.&&&&$.+F&/"&M

  • 6 Z. Hosenie et al.

    0.0 0.2 0.4 0.6 0.8 1.0

    Skew

    RRab

    RRc

    RRd

    Blazhko

    Ecl

    EA

    Rot

    LPV

    d-Scuti

    ACEP

    Cep-II0.0 0.2 0.4 0.6 0.8 1.0

    Mean

    RRab

    RRc

    RRd

    Blazhko

    Ecl

    EA

    Rot

    LPV

    d-Scuti

    ACEP

    Cep-II

    0.0 0.2 0.4 0.6 0.8 1.0

    Sigma

    RRab

    RRc

    RRd

    Blazhko

    Ecl

    EA

    Rot

    LPV

    d-Scuti

    ACEP

    Cep-II0.0 0.2 0.4 0.6 0.8 1.0

    Kurtosis

    RRab

    RRc

    RRd

    Blazhko

    Ecl

    EA

    Rot

    LPV

    d-Scuti

    ACEP

    Cep-II

    0.0 0.2 0.4 0.6 0.8 1.0

    Period

    RRab

    RRc

    RRd

    Blazhko

    Ecl

    EA

    Rot

    LPV

    d-Scuti

    ACEP

    Cep-II0.0 0.2 0.4 0.6 0.8 1.0

    Amplitude

    RRab

    RRc

    RRd

    Blazhko

    Ecl

    EA

    Rot

    LPV

    d-Scuti

    ACEP

    Cep-II

    0.0 0.2 0.4 0.6 0.8 1.0

    Mean Variance

    RRab

    RRc

    RRd

    Blazhko

    Ecl

    EA

    Rot

    LPV

    d-Scuti

    ACEP

    Cep-II

    Figure 4. One-dimensional density estimates of the selected features by class. The features considered are (a) Skew, (b) Mean, (c)Sigma, (d) Kurtosis, (e) Period, (f) Amplitude, (g) Mean Variance.

    trained model is then evaluated on the test set (30 per cent),XTest using various evaluation metrics described in xA.The performance of our pipeline is evaluated on balanced-accuracy, the Geometric-mean score, F1-score, recall andstandard confusion matrices. The classi�cation pipeline issummarised in Fig. 3.

    4.1 K-Fold cross validation

    When using machine learning algorithms, another majorproblem is an overoptimistic result, i.e the output resultsare too good to be true on training data, due to over/under�tting. Over-�tting mostly happens when we perform train-ing and evaluation on the same data. Therefore, classi�ca-tion algorithms must be tested on independent data to avoidthis problem. One method for avoiding this involves splitting

    MNRAS 000, 1{16 (2019)

  • Machine Learning Classi�cation for variable stars 7

    � 75 � 50 � 25 0 25 50 75t � SNE Feature1

    � 75

    � 50

    � 25

    0

    25

    50

    75

    t�S

    NE

    Fe

    atu

    re2

    RRab

    EA

    Rotational

    LPV

    (a) t-SNE with large sample data

    � 20 � 10 0 10 20 30t � SNE Feature1

    � 40

    � 30

    � 20

    � 10

    0

    10

    20

    30

    40

    t�S

    NE

    Fe

    atu

    re2

    Blazhko

    d-Scuti

    ACEP

    Cep-II

    (b) t-SNE with small sample data

    Figure 5. (a) shows the t-distributed stochastic neighbour embedding (t-SNE) visualization for the feature set after normalisation withlarge sample data (RRab, EA, Rotational and LPV). We note that the classes are quite well-separated in the embedding space. (b)illustrates t-SNE visualization for the feature set after normalisation with the small sample size data (Blazhko, �-Scuti, ACEP andCep-II). No distinct separation is seen within the small sample dataset.

    the data into training and testing sets as explained in x3.Another common method for splitting datasets for evalua-tion is known as K-fold cross-validation, that is, the train-ing dataset S is split randomly into K mutually exclusivesubsets (S1; S2; : : : ; SK ) of nearly the same size. The classi-�cation algorithm is trained and tested K times. For eachtime step t 2 f1; 2; : : : ; K g, the algorithm is trained on K � 1folds and one fold St is used for validation. In addition, astrati�cation of the data is applied such that for each ofthe K � f olds, the data are arranged to ensure each foldpreserves the percentage of samples for each class in thedataset at large. The overall balanced-accuracy of an algo-rithm trained/tested via cross-validation is simply the aver-age of each of the K balanced-accuracy measures obtainedafter each time step.

    5 MULTI-CLASS CLASSIFICATION

    Our main goal is to perform a multi-class classi�cation usingour classi�cation pipeline previously described. We �rst ap-ply some feature evaluation strategies to check whether ourextracted features in x3 are good for classi�cation. The mostcommon methods that characterise ‘good’ features: look forthe presence of linear correlations between the features andthe class labels, and/or we sometimes indirectly measure fea-ture utility by using classi�cation performance (for e.g Bateset al. (2011)) as a proxy. If performance is good, we assumethe features have utility. However, it is often misleading toevaluate features based on classi�er performance as it variesaccording to the classi�er used (Brown et al. 2012).

    In this work, we employ the three primary considera-tions as in Lyon et al. (2016) to evaluate our features. Afeature must i) show importance in discriminating betweenthe di�erent classes/types of variable stars, ii) maximise theseparation between the various variable stars, and iii) lastlyyield a good performance when used in conjunction with a

    classi�cation algorithm. We have therefore applied two fea-ture visualisation strategies to our features given in Table 1before performing a multi-class classi�cation.

    5.1 Visualisation of feature space withone-dimensional density estimates

    One way to visualise the features we extracted in x3, is toplot one-dimensional density estimates of the observed dis-tribution as shown in Fig. 4. This plot allows us to visuallycompare the distributions of the normalised features for thedi�erent classes of variable stars. The plots depict distinctfeature-by-feature characteristics of each class. It enables usto identify outliers and determine if there is multi-modalityin the feature space. For example, the RR Lyrae skew distri-butions are mostly narrow and peaked, showing that theseclasses are well-characterised by the skewness. In addition,the density plots provide us with information of which fea-tures are fundamentally important in discriminating di�er-ent sets of classes. Some classes have overlapping distribu-tions for one or more feature variables. This applies to theRR Lyrae, eclipsing binary and the Cepheid stars. Whilstother classes, such as the LPVs have trivially separable dis-tributions, suggesting that the LPV class should be easy toclassify. However, more rigorous investigation is needed todetermine with con�dence whether these features are usefulfor classi�cation purposes.

    5.2 t-SNE

    After visualising the individual features separately, we wishto understand the relationship between our features but thisis di�cult to do given the feature dimensionality we are deal-ing with. Yet it is possible to overcome this problem by re-ducing the dimensionality so that it is representable in a 2-or 3-D representation space. t-Distributed Stochastic Neigh-

    MNRAS 000, 1{16 (2019)

  • 8 Z. Hosenie et al.

    RRab

    RRc

    RRd

    Blaz

    hko Ec

    lEA Ro

    tLP

    V

    d-Scu

    ti

    ACEP

    Cep-

    II

    Predicted label

    RRab

    RRc

    RRd

    Blazhko

    Ecl

    EA

    Rot

    LPV

    d-Scuti

    ACEP

    Cep-II

    True

    labe

    l0.92 0.00 0.00 0.02 0.01 0.01 0.02 0.00 0.00 0.01 0.00

    0.00 0.77 0.06 0.00 0.12 0.00 0.04 0.00 0.00 0.00 0.00

    0.05 0.47 0.33 0.01 0.11 0.00 0.03 0.00 0.00 0.01 0.00

    0.75 0.00 0.00 0.21 0.02 0.00 0.02 0.00 0.00 0.00 0.00

    0.04 0.10 0.03 0.00 0.60 0.06 0.15 0.00 0.00 0.01 0.00

    0.03 0.00 0.00 0.00 0.03 0.90 0.02 0.00 0.00 0.01 0.00

    0.04 0.08 0.02 0.00 0.13 0.02 0.65 0.02 0.00 0.01 0.02

    0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.98 0.00 0.00 0.01

    0.00 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.96 0.00 0.00

    0.13 0.00 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.74 0.09

    0.00 0.00 0.00 0.00 0.04 0.07 0.20 0.02 0.00 0.15 0.52

    Confusion matrix for Random Forest Classi�er

    0.0

    0.2

    0.4

    0.6

    0.8

    Figure 6. Normalised confusion matrix for multi-class classi�cation with the RF classi�er using the best hyper-parameters from arandomized grid search cross-validation. The RF classi�er was applied on a 70% training set and evaluated on a 30% test set. Theclassi�er performs poorly on under-represented class labels.

    bor Embedding (t-SNE, van der Maaten & Hinton (2008))is a tool for performing this reduction and visualisation3.

    More speci�cally, similar objects in high-dimensionalspace are clustered together with nearby points using a k-Dtree (Bentley 1975) of all the points. Using a Student-t dis-tribution (same as the Cauchy distribution (Cauchy 1853)),the Euclidean distance between each point and its k-nearestneighbours is computed. This distance is further convertedinto a probability distribution. Similar points have a highprobability of being assigned to the same class and di�er-ent points have a low probability of being picked. After-wards, t-SNE constructs a similar probability distributionusing a gradient descent method, in the low-dimensionalspace over the points, thus minimizing the Kullback-Leiblerdivergence between the two probability distributions (Kull-back & Leibler 1951).

    Generally, classes that are well separated in a t-SNEvisualization, yield good levels of classi�cation performanceby machine learning systems. However, the converse is notstrictly true (Lochner et al. 2016). We use t-SNE only forthe visualization of class separability as there are variouslimitations with t-SNE, as neither the sizes of the clustersnor distance between the points may be informative (Wat-tenberg et al. 2016). For our high dimensional visualisation,we partition the data into two categories: i) data with large

    3 sklearn.manifold.TSNE

    sample sizes that consists of RRab, RRc, Ecl, EA, Rota-tional and LPV stars and ii) data with a small sample sizecontaining RRd, Blazhko, �-scuti, ACEP and Cep-II. Fig.5(a) shows quite well-separated classes for the large samplesizes and we would expect our ML algorithms to performwell using this data. Fig. 5(b) strongly suggests insepara-bility of the classes for the small sample data. This insep-arability may be attributed to the fact that there are fewexamples of each class. Also, another possible explanationis that, for these stars, other features might be required toenable separability. After performing some feature visuali-sation, we decided to use all of the features as inputs toperform a multi-class classi�cation.

    5.3 Performance of Multi-class classi�cation

    We implement three classi�ers, RFs, DTs and kNN to per-form an 11-classes classi�cation. Of these, RF achieves thebest performance with a balanced-accuracy rate of �70 percent on the 30 per cent test data. This poor performanceis not surprising, given the large class imbalance present inthe data, i.e. the classes with small sample sizes make upjust �6 per cent of the whole dataset. The results presentedhere are mainly focused on the RF classi�er as this achievesthe best performance balanced-accuracy. In Fig. 6, we plotthe confusion matrix of the RF classi�er which depicts thepredicted class versus the true class in a tabular form. The

    MNRAS 000, 1{16 (2019)

    http://scikit-learn.org/stable/modules/generated sklearn.manifold.TSNE.html

  • Machine Learning Classi�cation for variable stars 9

    g m sKu

    rt TAm

    p s m

    � 1.00

    � 0.75

    � 0.50

    � 0.25

    0.00

    0.25

    0.50

    0.75

    1.00

    Type 1: RRab

    Type 6: EA

    Figure 7. The box and whisker plots show how the features separate RRab (Type 1) and EA (Type 6) examples. The orange coloredboxes describe the feature distribution for RRab sources, where the corresponding black circles represent extreme outliers. The box plotsin green describe the EA distributions. There is no clear separation of the period (T ) between the two classes. The other features showvarying degrees of improved separability, and are generally easily separable at a visual level between the two di�erent types. Refer toTable 1 for de�nitions of the features used.

    results obtained by a perfect classi�er would result in val-ues only along the diagonal of the confusion matrix. Theo�-diagonals give us information about the various types oferrors that the classi�er makes.

    We note that some of the mistakes made by the classi�ercan be attributed to the fact that some of the science classesare physically similar, for example, the RR Lyrae, eclipsingbinaries and Cepheids subclasses. We also note that mostof the misclassi�ed examples arise from classes for whichthere are few training examples, hence indicating bias to-ward larger classes, which illustrates that class imbalanceis an issue. Classes comprising of many examples are morelikely to be correctly classi�ed, implying that having moreexamples of the under-represented science classes will helpthe classi�cation.

    6 BINARY CLASSIFICATION

    Given the complication of the multi-class classi�cationscheme, decomposing the multi-class problem into smallerbinary class problems may alleviate this issue. This may re-duce the higher complexity inherent to an 11-class problem.Therefore, in this section we consider binary cases for theclassi�cation scheme.

    In addition, the poor performance of the multi-classclassi�cation in the previous section can be attributed notonly to class imbalance but also to the relative importance offeatures. Therefore, to further investigate whether the fea-tures we used are important for classi�cation, we performan in-depth analysis of the features used for binary classi�-cation in x6.1, 6.2 & 6.3 by using roughly balanced classes.Feature selection strategies can be grouped in various cate-gories: classi�er-independent (�lter methods) and classi�er-dependent (wrapper and embedded methods) are the mostcommon. Whilst it is possible to evaluate feature importanceusing some classi�cation models (e.g. using a random forest),

    we instead decouple feature analysis from any speci�c clas-si�er model. We do this as wrapper methods are susceptibleto over�tting, which can lead to feature choices that maynot generalise well beyond the classi�er used during selec-tion (Brown et al. 2012). Thus in this paper, we used mostly�lter methods. These methods rank the features based onstatistical measures of information content, determined bycalculating the correlations/relationships between them.

    Recall that in x5.2, we sub-divided the data set into twosubsets, namely small and large samples, for which each hasroughly equal number of examples. We therefore considerpair-wise combinations of each class within the small-sampledataset and perform feature selection and binary classi�ca-tion. This is repeated for the large-sample dataset. Here, wereport on results for Type 1 (RRab) & Type 6 (EA) only forin-depth feature analyses. However, similar performance isobtained for the other pair-wise combinations. For the per-formance of binary classi�cation, we report the results ob-tained with Type 1 (RRab) & Type 6 (EA) from the largesample dataset and Type 9 (�-Scuti) & Type 10 (ACEP)from the small-sample dataset.

    6.1 Data visualisation for binary classes

    The discriminating capabilities of the features are examinedfor some binary cases. To determine if there is some amountof separability between the two classes, we plot the box andwhisker plots for the di�erent features extracted. The datahas been pre-processed by performing the normalization de-scribed in x3. Here, we take Type 1 (RRab) as the nega-tive class and Type 6 (EA) as the positive class. For eachindividual feature, the median value of the negative class issubtracted. This ensures a fair separability scale and centresthe plots around the median, hence a distinct separation isseen more clearly between the two classes. The box plotscentred on the median value represent the feature distribu-

    MNRAS 000, 1{16 (2019)

  • 10 Z. Hosenie et al.

    Table 2. The point-biserial correlation coe�cient and the MutualInformation MI (X;Y ) for each feature for the binary class pair:RRab (Type 1) and EA (Type 6) is illustrated. Higher mutualinformation is desirable. A high MI value implies that there isa strong correlation between the features and the class labels.In addition, the Joint Mutual Information (JMI) rank is givenfor each feature. A lower JMI rank is preferred. The JMI rankingillustrates the importance of each feature after taking into accountthe redundancy between features.

    Features

    Classes

    Type (1 & 6)

    rpb MI JMI

    Skew 0.648 0.503 2

    Mean - 0.547 0.260 6

    Std - 0.481 0.402 4

    Kurtosis 0.248 0.486 3

    Period, T 0.019 0.336 1

    Amplitude - 0.375 0.307 5

    Mean Vari-ance

    - 0.368 0.174 7

    tion of the negative class. Fig. 7 shows a reasonable amountof separability between Type 1: RRab and Type 6: EA. Foreach individual feature, we note a clear separability betweenthe two classes, except for the Period, T . At this point, weare tempted to write-o� this feature, however, for a morerigorous analysis of feature selection, we extend our inves-tigation by looking for any linear correlation between thefeatures and the class label.

    6.2 Point Biserial Correlation Test

    The point biserial correlation coe�cient, rpb (Gupta 1960)is applied to �nd the relationship between a continuous vari-able x, and binary variable, y. Similarly to other correlationcoe�cients, it varies between -1 and +1, where +1 corre-sponds to a perfect positive relation and -1 corresponds toa perfect negative relation while a value of 0 means thereis no association at all. The point biserial correlation coef-�cient is similar to the Pearson product moment (Pearson1895). More detailed information can be found in Lyon et al.(2016).

    Table 2 illustrates the correlation between the seven fea-tures studied and the target class variable, for one binaryclass. From Table 2, it is observed that there are three fea-tures that show strong correlations (> j 0:45 j), for instance,the skew, the mean and the standard deviation. It can alsobe seen from Table 2 that the T exhibits a weak correlationfor this binary class pair. This means that T might not beuseful for classi�cation. However, again one should be care-ful when judging the feature importance based upon theirlinear correlations. Features having linear correlations closeto zero, may have useful non-linear correlations.

    Table 3. A summary of the performance results of the vari-ous classi�ers, ordered from best-performing to worst-performingbased on the balanced-accuracy and G-mean for binary classi-�cation. Two values are given for all metrics. For each binarycase presented, the �rst values represent metric values for Type 1(RRab) and Type 9 (�-Scuti). The second values are the metricsvalues for Type 6 (EA) and Type 10 (ACEP). In addition, theaverage of those metrics are summarised in single value.

    Classi�ers Precision Recall F1-Score G-mean Balanced Accuracy

    Type 1 (RRab) and Type 6 (EA) Classi�cation

    RF 0.97/0.97 0.97/0.97 0.97/0.97 0.97/0.97 0.94/0.94

    �0.97 �0.97 �0.97 �0.97 �0.94

    DT 0.96/0.96 0.96/0.96 0.96/0.96 0.96/0.96 0.92/0.92

    �0.96 �0.96 �0.96 �0.96 �0.92

    KNN 0.95/0.97 0.97/0.95 0.96/0.96 0.96/0.96 0.92/0.91

    �0.96 �0.96 �0.96 �0.96 �0.92

    Type 9 (�-Scuti) and Type 10 (ACEP) Classi�cation

    RF 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0

    �1.0 �1.0 �1.0 �1.0 �1.0

    DT 1.00/0.96 0.96/1.00 0.98/0.98 0.98/0.98 0.95/0.96

    �0.98 �0.98 �0.98 �0.98 �0.96

    KNN 0.98/0.98 0.96/0.98 0.97/0.97 0.97/0.97 0.93/0.94

    �0.97 �0.97 �0.97 �0.97 �0.93

    6.3 Information Theory

    To investigate the relative importance of the features, we im-plement the Information Theoretic method that Lyon et al.(2016) applied to the pulsar search problem. According toGuyon & Elissee� (2003), features that appear to be infor-mation poor, may provide new and meaningful informationwhen combined with one or more other features. Informationtheory is mainly based on the entropy of a feature, X . TheMutual Information (MI, Brown et al. (2012)), which mea-sures the amount of information between the input featureX and the true class label Y is de�ned as:

    I (X ;Y ) = H (X ) � H (X j Y ) ; (2)

    where H (X ) is the entropy and H (X j Y ) is the conditionalentropy. The conditional probability represents the amountof uncertainty remaining in X , after knowing Y . Hence, themutual information is indicative of the amount of uncer-tainty in X that is removed by knowing Y . MI can be zero ifand only if X and Y are statistically independent. Therefore,it is useful for features to have high MI. A high MI indicatesthat a feature is correlated with the target variable, andthus can in principle be used by a classi�er to yield accurateclassi�cations.

    The MI value of the seven features are listed in Table 2.Using the MI method, we are able to choose the features thatbest describe the di�erence between two types of classes.However, the selected features might have redundant infor-mation, for example two features that give high MI mighthave the same information, in which a single selected featurecould be all that is required to achieve the same classi�cationresults.

    Therefore, a ranking system can be applied, which is

    MNRAS 000, 1{16 (2019)

  • Machine Learning Classi�cation for variable stars 11

    also known as the Joint Mutual Information (JMI) (Yang& Fong 2011) criterion. The JMI enables the detection ofcomplimentary information, and thereby a reduction in thenumber of features by eliminating redundancy. The JMI per-forms a selection from a sample of feature sets based on theamount of complementary information and ranks them. Itstarts from the feature that possesses the largest MI value,X1. A greedy iterative process is used to decide which fea-tures complement X1 (Guyon & Elissee� 2003). The JMIscore is given by,

    JMI�XJ

    �=

    X

    XK 2F

    I�XJ XK ;Y

    �; (3)

    where XJ XK is the joint probability of two features and F isthe selected features. The iterative process continues until aset of features are selected or all features are ranked. The useof the JMI enables a reduction in the amount of redundancywithin the features.

    We have applied the JMI to our feature data as shownin Table 2. We note that features which appear to be of lowinformation content, as determined via visual analysis andstudy using the point-biserial correlation coe�cient, now ap-pear to be useful. It is tempting to write o� features thatshow low scoring linear correlations. However, it can be seenthat even though the Period, T , exhibits a low linear correla-tion, it is ranked as the ‘1st ’ best feature by the JMI. Giventhat we have shown that all the seven features have utility,we apply them all for binary classi�cation.

    6.4 Performance of Binary Classi�cation

    For binary classi�cation, we train three classi�ers inde-pendently using roughly balanced pair-wise class combina-tions from the large-sample and small-sample datasets. Wepresent the results for i) Type 1: RRab & Type 6: EA andii) Type 9-�-Scuti & Type 10: ACEP only to show the di�er-ence between utilizing the two di�erent sized datasets. Sim-ilar results are obtained with other pair-wise combinationsof binary classes achieving balanced-accuracy and G-Meanvalues that vary by � �0:03.

    We found that the RF performs best with an overallbalanced-accuracy of 0.94 with a G-mean value of 0.97 forType 1: RRab & Type 6: EA. In addition, we observe thatthe two science classes in the small dataset samples (Type9-�-Scuti and Type 10: ACEP) have some level of misclassi�-cation in multi-class classi�cation, while in the case of binaryclassi�cation, the RF (being the best performing algorithm)outputs a balanced-accuracy of 1.0 with a G-mean value of1.0. The results for both cases studied are summarised inTable 3.

    As discussed in x5.3, multi-class classi�cation gave un-desirable results. On the other hand, using binary classi-�cation (x6.4) we show how to obtain improved balanced-accuracies for the science classes, achieving balanced-accuracies > 90 per cent in all cases we investigated. Wethus now attempt to build upon the result by following anhierarchical approach to classi�cation which will allow us tobreak the problem into sets of binary comparisons or smallermulti-class comparisons.

    !"#$%&$’()*+,-.

    /0#&12$’(),+3-*.

    !"#45%6787)38+*.

    !945%67:)38+*.

    ;;7 8+.

    8@.

    ;;C45%67D)8+@.

    H#1IBJK45%673),>,.

    9A!/45%67,+),8D.

    A6%ELL45%67,@),8D.

    ?1=$1G#67F21=&

    ;K212$K’1#45%67>M7)D:D:.

    Figure 8. A hierarchy of variable-star classi�cation for data usedin x2 which is constructed based on the understanding of theirphysical properties. At the top layer, the variable-stars can besplit into three major classes: eclipsing, rotational and pulsatingsystems. The number in parenthesis represents the number ofsamples in each particular class.

    7 HIERARCHICAL CLASSIFICATION -RESULTS AND DISCUSSION

    Using some astrophysical properties of the variable stars inour dataset, we group the variable sources into categories asshown in Fig. 8. The �rst level of the hierarchy is split intothree broad science classes: eclipsing, rotational and pulsat-ing. From the top level, we continue to subdivide the classes,for instance, at the third level, we are left with exactly twomain classes: RR Lyrae and Cepheids with 6 science sub-classes in the �nal sub-node. For further details on hier-archical classi�cation, we refer the reader to the excellentreview by Silla & Freitas (2011).

    Firstly, we perform a multi-class classi�cation on the�rst hierarchy layer to distinguish between eclipsing, rota-tional and pulsating stars. This method helps to addresssome of the class imbalance issues. However, this separationis not perfect because the rotational class has many lessexamples than the other two classes. The same evaluationmethodology as in x4 is employed and the results for thetop layer is summarised in Table 4. We obtain a balanced-accuracy rate of � 61 per cent with a G-mean value of � 0:79for the �rst layer.

    We then move down the hierarchy to perform two sep-arate classi�cations for layer 2, keeping the same examplesin the training sets and the test sets as in the top layer.We subdivide the eclipsing binary group into Ecl and EAclasses and perform a binary classi�cation. The result ofthis experiment shows that we are successful in classifyingbetween the two classes of objects with a 0.86 balanced-accuracy and 0.93 G-mean value. In the second layer, weundertake a multi-class classi�cation of four distinct typesof classes: RR Lyrae, LPV, Cepheids and �-Scuti. We achievea balanced-accuracy of 0.98 in distinguishing between thoseclasses.

    Eventually, we follow the hierarchy down to the thirdlayer where we investigate the categorization of two classes:RR Lyrae and Cepheids. Our aim here is to �nd to what ex-tent we will be successful in distinguishing between the sub-classes of RR Lyrae (RRab, RRc, RRd and Blazhko) andCepheids (ACEP and Cep-II). The analysis shows that weare successful in distinguishing between RRab and RRc withhigh balanced-accuracy. However, most of the RRd sources

    MNRAS 000, 1{16 (2019)

  • 12 Z. Hosenie et al.

    Eclip

    sing

    Rota

    tiona

    l

    Pulsa

    ting

    Predicted label

    Eclipsing

    Rotational

    Pulsating

    Tru

    e la

    bel

    0.66 0.19 0.15

    0.11 0.74 0.15

    0.07 0.06 0.87

    Confusion matrix for Random Forest Classifier

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    (a) First level

    Ecl

    EA

    Predicted label

    Ecl

    EA

    Tru

    e la

    bel

    0.92 0.08

    0.06 0.94

    Confusion matrix for Random Forest Classifier

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    (b) Second level

    RR L

    yrae LP

    V

    Ceph

    eids

    -Scu

    t i

    Predicted label

    RR Lyrae

    LPV

    Cepheids

    -Scut i

    Tru

    e la

    bel

    0.99 0.00 0.01 0.00

    0.00 0.99 0.01 0.00

    0.05 0.00 0.95 0.00

    0.00 0.00 0.00 1.00

    Confusion matrix for Random Forest Classifier

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    (c) Second level

    RRab RR

    cRR

    d

    Blaz

    hko

    Predicted label

    RRab

    RRc

    RRd

    Blazhko

    Tru

    e la

    bel

    0.98 0.00 0.00 0.02

    0.00 0.90 0.10 0.00

    0.06 0.50 0.44 0.01

    0.81 0.00 0.00 0.19

    Confusion matrix for Random Forest Classifier

    0.0

    0.2

    0.4

    0.6

    0.8

    (d) Third level

    ACEP

    Cep-

    II

    Predicted label

    ACEP

    Cep-II

    Tru

    e la

    bel

    0.96 0.04

    0.15 0.85

    Confusion matrix for Random Forest Classifier

    0.2

    0.4

    0.6

    0.8

    (e) Third level

    Figure 9. The normalized confusion matrices for the best classi�er (RF) for hierarchical classi�cation.

    MNRAS 000, 1{16 (2019)

  • Machine Learning Classi�cation for variable stars 13

    Table 4. A summary of the performance results of the RF classi�er for the variable-star hierarchical classi�cation. We report metricsper class, separated by ‘/’. Also, we compute the average metrics taking into consideration the overall classes, summarised in a singlevalue.

    Precision Recall F1-Score G-Mean Balanced Accuracy

    First Level: Eclipsing, Rotational and Pulsating Classi�cation

    0.97/0.19/0.51 0.66/0.74/0.87 0.79/0.30/0.64 0.78/0.78/0.86 0.59/0.60/0.75

    �0.86 �0.70 �0.74 �0.79 �0.61

    Second Level: RR Lyrae, LPV, Cepheids and �-Scuti

    1.00/1.00/0.75/0.96 0.99/0.99/0.95/1.00 0.99/1.00/0.84/0.98 0.99/1.00/0.97/1.00 0.98/0.99/0.93/1.00

    �0.99 �0.99 �0.99 �0.99 �0.98

    Second Level: Ecl and EA

    0.90/0.50 0.92/0.94 0.95/0.65 0.93/0.93 0.86/0.86

    �0.95 �0.92 �0.93 �0.93 �0.86

    Third Level: RRab, RRc, RRd and Blazhko

    0.96/0.93/0.36/0.29 0.98/0.90/0.44/0.19 0.97/0.91/0.40/0.23 0.97/0.92/0.65/0.44 0.94/0.85/0.40/0.18

    �0.90 �0.90 �0.90 �0.92 �0.86

    Third Level: ACEP and Cep-II

    0.86/0.95 0.96/0.85 0.91/0.90 0.90/0.90 0.82/0.80

    �0.91 �0.90 �0.90 �0.90 �0.81

    are classi�ed as RRc and Blazhko sources are classi�ed asRRab. To check whether our results are a�ected by classimbalance, we downsample RRc in the training set to thesame number of samples as RRd and perform a binary clas-si�cation. This process is repeated for RRab, whereby thenumber of objects is decreased to the same number as inBlazhko class. We train a binary classi�er, keeping the testset the same for RRc & RRd and RRab & Blazhko. Wefound that using balanced classes does increase the perfor-mance of the classi�er signi�cantly, for instance, we are ableto distinguish between RRab & Blazhko and RRc & RRdwith a balanced-accuracy rate of 80 per cent and 75 per centrespectively.

    For an in-depth analysis of the RR Lyrae sub-classi�cation, we use the data provided by Drake et al.(2017) that utilised the Adaptive Fourier Decomposition(AFD) method (Torrealba et al. 2015) to determine the pe-riod of each source. To see how the visually selected pe-riodic variables di�er from the initial candidates; we plotthe amplitude and the period distribution of the variablestars available in the catalog. From Fig. 10, we note thatit is a challenging task to separate the subclasses: RRab& Blazhko and RRc & RRd. We found that our ML clas-si�er also struggles to separate those classes. In addition,we observe a clear separation between RRab and RRc, andour classi�cation pipeline is also able to separate these twoclasses. In the same vein, Malz et al. (2018) point out it isgenerally challenging to have a clear separation between RRcand RRd even though they have di�erent pulsating modes

    and due to the rarity of RRd sources, they are often sub-sumed by RRc labels. Moreover, Drake et al. (2017) arguedthat RRd phased light curves often appear like those of RRcand Blazhko stars often resemble RRab’s when phase-foldedover multiple cycles.

    Furthermore, we analyse the classi�cation of Cepheidsand we obtain an error classi�cation rate of �19 per cent. Onthe same note, Drake et al. (2017) argued that classifyingACEP and BL Her (a sub-groups of Cep-II) is di�cult for�eld stars because of the mixture of stellar populations inthe �eld and the inadequate distance information.

    For our hierarchical model, we present the RF classi�erresults only as it performs well compared to other classi-�ers discussed in this paper. From the results in Table 4and Fig. 9, we see immediately a disparity in performanceamong the classes. This discrepancy in misclassi�cation isdue to the comparative size of each of the science classes. Wenote that we obtain high performance with classes that aredata-rich. To alleviate the class-imbalance problem, one caneither gather di�erent catalogs for the under-sampled classesor augment the existing available catalogs via sophisticatedstatistical machine learning approaches.

    In Fig. 11, we plot the precision-recall curve for eachclass and we note that the classi�cation performance is verygood. The area under the precision-recall curve values aregreater than 0.85 for several classes, except for Type 7: Ro-tational, Type 3: RRd and Type 4: Blazhko. This correlateswith the results presented in Table 4. In addition, we com-pare our proposed hierarchical approach to an already imple-

    MNRAS 000, 1{16 (2019)

  • 14 Z. Hosenie et al.

    0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8Period (days)

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    Am

    plitu

    de,

    AV

    CS

    S

    1. RRab

    2. RRc

    3. RRd

    4. Blazhko

    (a) The period-amplitude distribution

    0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8Period (days)

    100

    101

    102

    103

    Num

    ber

    ofS

    ampl

    es

    RRab

    RRc

    RRd

    Blazhko

    (b) The period distribution

    Figure 10. We plot the distribution of RR Lyrae sub-types (RRab, RRc, RRd and Blazhko) using available data from the ascii catalogof the sources in Drake et al. (2017).

    mented machine learning package known as UPSILON (Kim &Bailer-Jones 2016). The latter used a RF classi�er, trainedon 16 features extracted from OGLE (Udalski et al. 1997)and EROS-2 (Tisserand et al. 2007) periodic variable starslight curves; while our model is trained on 7 features fromCRTS data. The comparison is made with only 8 classesout of 11 as UPSILON has not been trained on all the vari-able stars available in CRTS data. We report the results interms of recall and F1-score in Table 5. It is seen that ourhierarchical model outperforms the UPSILON model in clas-sifying most of the variable stars, except for Type 6: EA.We can plausibly say that our hierarchical classi�er yieldsgood performance with 7 features in classifying sub-groupsof stars as compared to the UPSILON package. For a com-parison in terms of features used, we report results whenusing a ‘at’ multi-class model (RF) in x5.3 with 7 featuresand compare it with the hierarchical model. The hierarchicalmodel outperforms both the UPSILON model and the multi-class model developed in this paper. This clearly shows thatit is not necessary to extract many features to obtain higherclassi�cation metrics. In addition, we have shown that con-verting a ‘at’ multi-class problem into a hierarchical systemimproves variable star classi�cation.

    8 CONCLUSION AND FUTURE WORK

    With the upcoming synoptic surveys (for e.g. LSST, Ivezicet al. (2008)), automated transient/variable-star classi�ca-tion is becoming an increasingly important �eld in astron-omy. It presents a di�cult computational challenge whichwe have nonetheless attempted to tackle in this work.We describe the core challenges associated with automaticvariable-star classi�cation and subsequently explored the na-ture of the data to be processed. We then applied variousapproaches with the aim of accurately classifying 11 types ofvariable star. Some of the methods we applied are similar tocurrent techniques, while others are new to this �eld. Com-pared to other related work, we utilized only 7 input featuresduring classi�cation, where 6 features are based on statisticalproperties of the unfolded data and the Period, T , is obtaineddirectly from the data catalog. We used these features as in-

    puts to three separate commonly employed ML algorithms.We demonstrate that the RF classi�cation algorithm per-formed best in all test cases. Treating the variable-star clas-si�cation as a multi-class problem with 11 classes, resultsin a poor performance. In doing so, the RF algorithm is ef-�cient at identifying RRab, RRc, Ecl, EA, Rotational andLPV classes, but is unsuccessful in distinguishing, for exam-ple, �-Scuti and ACEP. However, by decomposing the multi-class problem into several binary classi�cation problems, wegain a signi�cant improvement in balanced-accuracy - witha balanced-accuracy rate of 1.0 for the classi�cation of �-Scuti and ACEP. Our results suggest that decomposing amulti-class problem into several binary classi�cation stepscould yield improved results.

    We therefore developed a hierarchical approach for clas-sifying the 11-variable-star classes in our data, by aggregat-ing those classes that show similarities. We show that a hi-erarchical taxonomy for n-class objects improves the classi-�cation rate. We are now successful in identifying sub-typesof cepheids and eclipsing binaries with a balanced-accuracyrate of 81 per cent and 86 per cent respectively. Whilst thehierarchical approach does not work well for all classes, thisis understandable in cases with high class-imbalances and alack of training examples.

    We have presented an approach to analysing variablestar data in such a way, that is bene�cial to machine learningfeature analysis and extraction. This process yields insightsthat allow us to obtain improved classi�cation performance.In other words, we have taken a principled approach to fea-ture analysis and design, especially for multi-class problemswith a highly imbalanced datasets. We employ new meth-ods for feature selection and evaluation and we show thatconverting a multi-class problem towards a hierarchical clas-si�cation scheme helps to reduce the class-imbalance prob-lem as well as provide a signi�cant improvement for variablestars classi�cation. In the near future, we wish to investi-gate the impact of data augmentation on the performanceof our ML classi�ers. Moreover, agging data (as we did forthe \Miscellaneous class") is often undesirable. One schoolof thought is to treat them as outliers but another mightargue that these could perhaps indicate new discoveries. Itis a major task with such a classi�cation scheme, since we

    MNRAS 000, 1{16 (2019)

  • Machine Learning Classi�cation for variable stars 15

    Table 5. Comparison of our Hierarchical model, Multi-class model and UPSILON package (Kim & Bailer-Jones 2016).For a fair comparison, we test these models on 8 classes and report the scores in terms of recall and F1-score. Themodel with higher classi�cation score in highlighted in blue.

    Types of Variable Stars

    UPSILON Model Our Multi-Class Model Our Hierarchical Model

    with 16 Features with 7 Features with 7 Features

    Recall F1-Score Recall F1-Score Recall F1-Score

    RRab 0.75 0.86 0.92 0.73 0.98 0.97

    RRc 0.78 0.87 0.77 0.46 0.90 0.91

    RRd 0.11 0.20 0.33 0.14 0.44 0.40

    Ecl (EC & ESD) 0.75 0.73 0.60 0.74 0.92 0.95

    EA 0.97 0.98 0.90 0.68 0.94 0.65

    LPV 0.92 0.96 0.98 0.95 0.99 1.00

    �-Scuti 0.86 0.93 0.96 0.70 1.00 0.98

    Cep-II 0.38 0.55 0.52 0.32 0.85 0.90

    Figure 11. Precision-Recall curves for each node in the hierar-chical model. Each curve represents a di�erent variable stars withthe area under the precision-recall curves score in brackets. Thismetric is computed on the 30% of the dataset used for testing,except that Type 5: Ecl stars has �15647 samples.

    now have to tackle the problem in an unsupervised way, atopic currently in its infancy.

    ACKNOWLEDGEMENTS

    We thank the referee for useful comments and suggestions inimproving this manuscript. ZH acknowledges support fromthe UK Newton Fund as part of the Development in Africawith Radio Astronomy (DARA) Big Data project deliveredvia the Science & Technology Facilities Council (STFC).BWS acknowledges funding from the European ResearchCouncil (ERC) under the European Union’s Horizon 2020research and innovation programme (grant agreement No.694745). AM is supported by the Imperial President’s PhDScholarship.

    APPENDIX A: EVALUATION METRICS

    The performance of any machine learning algorithm is evalu-ated using measures such as the balanced-accuracy, the pre-cision, the recall, the F1 score, sensitivity and speci�city(Chao et al. 2004). They are de�ned in terms of True Posi-tive (TP ) Rate, False Positive (FP ) Rate, True Negative (TN )Rate and False Negative (FN ) Rate. The sensitivity metricis de�ned as the true positive rate or positive class accuracy,while speci�city is referred to as the true negative rate orequivalently negative class accuracy (Danjuma 2015).

    � Sensitivity Measure: Equation A1 also known as thetrue positive rate or \recall". It measures the proportion ofactual positives correctly identi�ed by the model.

    Sensitivity/Recall =TP

    TP + FN(A1)

    � Speci�city Measure: Equation A2 also known as TrueNegative rate. It measures how well a model identi�es neg-ative results.

    Speci�city =TN

    TN + FP(A2)

    � Precision: It is a measure of retrieved instances that arecorrectly labelled. Precision is described in Equation A3.

    Precision =TP

    TP + FP(A3)

    � F1-score: It is a metrics that aims to quantify overallperformance, expressed in terms of precision and recall asshown in Equation A4.

    F1-Score =2 � Precision �Recall

    Precision + Recall(A4)

    � Balanced accuracy measure: It is de�ned as the averageof recall obtained on each class and it is a metric that dealwith imbalanced classes.

    MNRAS 000, 1{16 (2019)

  • 16 Z. Hosenie et al.

    Table A1. Confusion matrix implemented for our speci�c prob-lem, where the positive class corresponds to Class I and the nega-tive class to Class II for the two classi�cation schemes. True/falsepositives/negatives are represented as TP; FP; TN ; and FN re-spectively.

    Class II Class I

    ActualClass

    Class II TN FP

    Class I FN TP

    Predicted Class

    Balanced Accuracy =Sensitivity + Speci�city

    2� 100% (A5)

    � Geometric Mean Score: G-Mean is de�ned as thesquared root of the product of class-wise sensitivity. It max-imizes the accuracy on each class in addition to keep theseaccuracies balanced.

    G-Mean =p

    Sensitivity � Speci�city (A6)

    � Precision-Recall curve: PR curve illustrates the trade-o� between TPR and positive predictive value for a modelat di�erent probability thresholds.

    � Confusion Matrices: The TP; FP; TN ; and FN can be vi-sualized by a confusion matrix as illustrated in Table A1,where the predicted class is indicated in each column andthe actual class in each row. In this case, from Table A1, thetrue positives (TP ) are Class I examples that were correctlyclassi�ed as Class I, false positives FP correspond to Class IIexamples wrongly classi�ed as Class I. In a similar way, falsenegatives FN and true negatives TN can be explained. Forbinary classi�cation problems, Class I corresponds to posi-tive class and Class II corresponds to negative class. Afterthe training and testing process, we evaluate our pipeline us-ing balanced-accuracy, G-mean, F1-score, recall values andconfusion matrices.

    REFERENCES

    Bates S., Bailes M., Bhat N., Burgay M., et al., 2011, MonthlyNotices of the Royal Astronomical Society, 416, 2455

    Belokurov V., Evans N. W., Le Du Y., 2003, Monthly Notices ofthe Royal Astronomical Society, 341(4), 1373

    Bentley J. L., 1975, Communications of the ACM, 18 (9), 509Bergstra J., Yamins D., Cox D. D., 2013, in Proceedings of the

    12th Python in science conference. pp 13{20Blazhko S., 1907, Astronomische Nachrichten, 175, 325Breiman L., 2001, Machine Learning, 45, 5Brown G., Pocock A., Zhao M.-J., Luj�an M., 2012, Journal of

    machine learning research, 13, 27Buturovic L. J., 1993, Pattern Recognition, 26, 611Catelan M., Smith H. A., 2015, Wiley-VCHCauchy A., 1853, C.R. Acad. Sci, 37, 198Chao C., Liaw A., Breiman L., 2004, University of California,

    BerkeleyDanjuma K. J., 2015, CoRR, abs/1504.04646Dietterich T. G., 2000, Multiple Classi�er Systems, 1857, 1Djorgovski S. G., Donalek C., Mahabal A., et al., 2011, preprintDjorgovski S. G., Graham M. J., Donalek C., et al., 2016, preprintDrake A. J., Djorgovski S. G., Mahabal A., et al., 2009, ApJ, 696,

    870

    Drake A. J., Djorgovski S. G., Catelan M., Graham M. J., et al.,2017, Monthly Notices of the Royal Astronomical Society, 469(3), 3688

    Eyer L., Blake C., 2005, MNRASGregory P. C., Loredo T. J., 1992, Astrophysical JournalGupta S., 1960, Psychometrika, 25(4), 393Guyon I., Elissee� A., 2003, Journal of Machine Learning Re-

    search, 3, 1157He H., Garcia E. A., 2008, IEEE Transactions on Knowledge &

    Data Engineering, pp 1263{1284Ivezic Z., Kahn S. M., Tyson J. A., et al., 2008, American Astro-

    nomical SocietyJuric M., Kantor J., Lim K. T., Lupton R. H., et al., 2015, As-

    tronomical Data Analysis Software and Systems, 512Kim D.-W., Bailer-Jones C. A., 2016, Astronomy & Astrophysics,

    587, A18Kullback S., Leibler R. A., 1951, The Annals of Mathematical

    Statistics, 22(1), 79Last F., Douzas G., Bacao F., 2017, arXiv:1711.00837Lochner M., McEwen J. D., Peiris H. V., et al., 2016, The Astro-

    physical Journal Supplement Series, 225(1), 14Lomb N. R., 1976, Astrophysics and Space Science, 39, 447Lyon R. J., Stappers B. W., Cooper S., Brooke J. M., Knowles

    J. D., 2016, Monthly Notices of the Royal Astronomical Soci-ety, 459, 1104

    Mahabal A. A., Donalek C., Djorgovski S. G., et al., 2012, NewHorizons in Time Domain Astronomy, 285, 355

    Mahabal A., Sheth K., Gieseke F., et al., 2017, IEEE SymposiumSeries on Computational Intelligence, p. 2757

    Malz A., Hlozek R., Allam T. J., Bahmanyar A., et al., 2018,arXiv:1809.11145

    Narayan G., Zaidi T., Soraisam M. D., et al., 2018, AstrophysicalJournal Supplement Series, 236(1)

    Nun I., Protopapas P., Sim B., et al., 2015, arXiv:1506.00010Pearson K., 1895, Proceedings of the Royal Society of London,

    58, 240Pedregosa F., et al., 2011, Journal of Machine Learning Research,

    12, 2825Quinlan J. R., 1986, Machine Learning, 1, 81Revsbech E. A., Trotta R., van Dyk D. A., 2018, MNRAS, 473,

    3969Richards J. W., et al., 2011, The Astrophysical Journal, 733, 10Saha A., Vivas A. K., 2017, The Astronomical Journal, 154Scargle J. D., 1982, Astrophysical Journal, 263, 835Silla C. N. J., Freitas A. A., 2011, Data Mining and Knowledge

    Discovery, 22, 31Tisserand P., et al., 2007, A&A, 469, 387Torrealba G., Catelan M., Drake A. J., Djorgovski S. G., et al.,

    2015, Monthly Notices of the Royal Astronomical Society, 446,2251

    Udalski A., Kubiak M., Szymanski M., 1997, Acta Astron., 47,319

    Wattenberg M., Fernanda V., Johnson I., 2016, DistillWillemsen P., Eyer L., 2007, arXiv:0712.2898v1Yang H., Fong S., 2011, Proceedings of the 13th international

    conference on Data warehousing and knowledge discovery, pp471{483

    Zhang H., 2004, FLAIR, 2van der Maaten L., Hinton G., 2008, Journal of Machine Learning

    Research, pp 2579{2605

    This paper has been typeset from a TEX/LATEX �le prepared bythe author.

    MNRAS 000, 1{16 (2019)

    http://dx.doi.org/10.1051/0004-6361:20066017https://ui.adsabs.harvard.edu/abs/2007A&A...469..387Thttps://ui.adsabs.harvard.edu/abs/1997AcA....47..319Uhttps://ui.adsabs.harvard.edu/abs/1997AcA....47..319Uhttp://dx.doi.org/10.23915/distill.00002

    IntroductionDATAFeature Generation