Rattle Graphical Interface for R Language
-
Upload
majid-abdollahi -
Category
Data & Analytics
-
view
70 -
download
3
Transcript of Rattle Graphical Interface for R Language
05/02/2023 IAUSHIRAZ 1
INTRODUCTION TO R AND RATTLE
05/02/2023 IAUSHIRAZ 2
What is the RStatistical Programming Language
used among statisticians and data miners for developing statistical software and data analysis.
Free and Open Source
Written in C, Fortran and R
Statistical featuresLinear and nonlinear modelingStatistical testsClassification, Clustering
Can manipulate R Objects with C, C++, Java, .NET or Python code.
05/02/2023 IAUSHIRAZ 3
Source Example> x <- c(1,2,3,4,5,6) # Create ordered collection (vector)> y <- x^2 # Square the elements of x> print(y) # print (vector) y[1] 1 4 9 16 25 36> mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar[1] 15.16667> var(y) # Calculate sample variance[1] 178.9667> lm_1 <- lm(y ~ x) # Fit a linear regression model "y = f(x)" or "y = B0 + (B1 * x)" # store the results as lm_1> print(lm_1) # Print the model from the (linear model object) lm_1
Call:lm(formula = y ~ x)
Coefficients:(Intercept) x -9.333 7.000
> summary(lm_1) # Compute and print statistics for the fit # of the (linear model object) lm_1
Call:lm(formula = y ~ x)
Residuals:1 2 3 4 5 63.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333
Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) -9.3333 2.8441 -3.282 0.030453 *x 7.0000 0.7303 9.585 0.000662 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.055 on 4 degrees of freedomMultiple R-squared: 0.9583, Adjusted R-squared: 0.9478F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662
> par(mfrow=c(2, 2)) # Request 2x2 plot layout> plot(lm_1) # Diagnostic plot of regression model
05/02/2023 IAUSHIRAZ 4
Graphical front-endsArchitect – cross-platform open source IDE based on Eclipse and StatETDataJoy – Online R Editor focused on beginners to data science and collaboration.Deducer – GUI for menu-driven data analysis (similar to SPSS/JMP/Minitab).Java GUI for R – cross-platform stand-alone R terminal and editor based on Java (also known as JGR).Number Analytics - GUI for R based business analytics (similar to SPSS) working on the cloud.Rattle GUI – cross-platform GUI based on RGtk2 and specifically designed for data mining.R Commander – cross-platform menu-driven GUI based on tcltk (several plug-ins to Rcmdr are also
available).Revolution R Productivity Environment (RPE) – Revolution Analytics-provided Visual Studio-based IDE, and
has plans for web based point and click interface.RGUI – comes with the pre-compiled version of R for Microsoft Windows.RKWard – extensible GUI and IDE for R.RStudio – cross-platform open source IDE (which can also be run on a remote Linux server).
05/02/2023 IAUSHIRAZ 5
What is the RattleR Graphical User Interface Package
Offered by Graham Williams in Togaware Pty Ltd.
Free and Open Source
Represents Statistical and Visual Summaries of data
Tabs :Load DataData ExplorationModelEvaluationTest…
05/02/2023 IAUSHIRAZ 6
Rattle Installation ProcessDownload and Installing R
https://r-project.orgAbout 60MB
Download the Rattle PackageAbout 300MBFollow Instructions :
install.packages("rattle", dependencies=c("Depends", "Suggests")) Library(rattle) Rattle()
05/02/2023 IAUSHIRAZ 7
Load DataDataset Types :
CSV File (CSV, TXT, EXCELL)ARFF (CSV File which adds type information)ODBC (MySQL, SqlLITE, SQL Server, …)
Set Connections in : /etc/odbcinst.ini & /etc/odbc.iniR Dataset (Existing Datasets in Current Solution)R Data FileLibrary (Pre Existing Datasets)Corpus ( Collection of Documents)Script (Scripts for Generating Datasets)
05/02/2023 IAUSHIRAZ 8
Load DataVariable Types :
Input (Most Variables as Input) Predict the Target Variables
Target (Influenced by the Input Variables) Known as the Output Prefix : TARGET_
Risk (Measure of the size of the Targets) Prefix : RISK_
Identifier (any Numeric Variable that has a Unique Value – Not Normally used in modeling) Such as : ID, Date Prefix : ID_
Ignore (Ignore from Modeling) Prefix : IGNORE_
Weight (Weighted by R Formula)
05/02/2023 IAUSHIRAZ 9
TransformRescale
Normalize Re Center Scale [0-1] Median/Mad Natural Log / Log 10 Matrix
Order Rank Interval Number of Group
05/02/2023 IAUSHIRAZ 10
TransformImpute (missing values)
ZeroMeanMedianModeConstant
RecodeQuantilesK-MeansEqual withIndicator variable / Join CategoriesAs Categorical / As Numeric
05/02/2023 IAUSHIRAZ 11
TransformCleanup
Delete IgnoredDelete SelectedDelete MissingDelete Observations with Missing
05/02/2023 IAUSHIRAZ 12
ExplorationSummary
Summary Min, Max, Mean, Quartiles Values.
Describe Missing, Unique, Sum, Mean, Lowest, Highest Values.
Basics (For Numeric Value) Measures of Numeric Data (Missing, Min, Max, Quartiles, Mean, Sum, Skewness, Kurtosis)
Kurtosis (For Numeric Value) A larger value indicates a sharper peak. A lower value indicates a smoother peak.
Skewness (For Numeric Value) A positive skew indicates that the tail to the right is longer. A negative skew that the tail to the left is longer.
05/02/2023 IAUSHIRAZ 13
ExplorationSummary
Show Missing Each row corresponds to a pattern of missing values. Perhaps coming to an understanding of why the data is missing. Rows and Columns are sorted in ascending order of missing data.
05/02/2023 IAUSHIRAZ 14
ExplorationDistributions (review the distributions of each variable in dataset)
Annotate (include numeric values in plots)Group byNumeric Outputs :
Box Plot Histogram Cumulative Benford
For any number of continuous variables Pairs
Categorical Outputs : Bar Plot Dot Plot Mosaic Pairs
05/02/2023 IAUSHIRAZ 15
ExplorationCorrelations (Rattle only computes correlations between numeric variables at this time)
Ordered Order by strength of correlations
Explore Missing Correlation between missing values
Hierarchical Pearson Kendall Spearman
Principal ComponentsSVD
For only Numeric VariablesEigen
05/02/2023 IAUSHIRAZ 16
ModelTree
Traditional Trade off between performance and simplicity of explanation
Conditional
Forest (many decision trees using random subsets of data and variables)Number of TreesNumber of VariablesImpute (set median numeric value for missing values)Sample Size (for balancing classes)Importance (variable importance)Rules (collection of random forest rules)ROC (ROC Curve)Errors
05/02/2023 IAUSHIRAZ 17
ModelSVM
Start with two parallel vector
Linear (linear regression)For continues values
All
05/02/2023 IAUSHIRAZ 18
ClusterK-Means
Set First K
EwKmK-Means with entropy weighting
HierarchicalNot needed to set first Cluster Number
BiClusterSuitable subsets of both the variables and the observations