Post on 28-May-2022
SASforActuarialScienceStudents
ZhaochenGaoMentor:DavidVarodayan
UniversityofIllinoisatUrbana-Champaign
Chapter1BasicsofSASinActuarialScienceThischapterwillfocusonthreebasicimplementationsofSASindataprocessing,includingtheDataReading,DataSubsetandDataValidatingandCleaning.Chapter1.1DataReadingInordertoanalyzethedatathroughSAS,atfirst,wehavetolearnabouthowtoreadtherawdatafilesthroughSAS.Youareexpectedtoimportdatafrommanydifferentfileformats.Usually,therearethreefileformats:1)SASDataSets,2)ExcelWorksheet3)DelimitedRawDataFiles.
ReadingSASDataSets libname________________________________;data_____________________________________;set__________________________________;…run;
ReadingExcelWorksheets libname________________________________;data_____________________________________;set__________________________________;…run;
ReadingDelimitedRawDataFiles data____________________________________;infile______________________________;input______________________________;…run;
ForSASDataSetsandExcelWorksheets:TheLIBNAMEstatementassignsalibraryreferencenametoaSASdatalibrary,ifapermanentdatasetisbeingreadorcreated.
LIBNAMElibref'physical-file-name'<options>;
TheDATAstatementcancreatetemporaryorpermanentdatasets,whichisfollowedbythenameofoutputdataset.TheSETstatementautomaticallyreadalltheobservationsfromtheinputdataset.ForDelimitedRawDataFiles:
1) StandardDelimitedRawData:
TheINFILEstatementaccessesthephysicalnameoftherawdatafile.
TheDLM=optioncanbeaddedtotheINFILEstatementtoidentifythespecificdelimiter.Forexample,dlm=’’(blankspace),dlm=’,’anddlm=’09’x(tab).TheINPUTstatementspecifieshowtoreadthedatafromrawdatafiles.Inputspecificationsincludecolumninput,formattedinputandlistinput.Wecanadd$signtoindicatethatwewanttostorethevariableasacharactervalueratherthananumericvalue.
TheLENGTHstatementdefinesthelengthofavariable.
LENGTHvariable(s)$length;
2) NonstandardDelimitedData
TheDSDoptionfortheINFILEstatementsetsthedefaultdelimitertoacomma,treatsconsecutivedelimitersasmissingvaluesandenablesSAStoreadvalueswithembeddeddelimitersifthevalueissurroundedbyquotationmarks.TheMISSOVERoptionrecognizesmissingvaluesattheendofarecord
INFILE'raw-data-file-name'DSDMISSOVER;Example:ForourdatasetforTitanic,it’sarawdataset.Wecanreaditthroughthisway.
Andwewillgettheoutput.
Atfirst,wedefinethelengthofeachvariable.AndwereadPassengerID,NameandSexascharactervalue.Sincethefirstlineofourrawdatasetisthenumberofeachvariables,wereadthesecondlineasourfirstobservation.Chapter1.2SubsetDataLearninghowtosubsetdataisquiteusefulwhenwereceivemassiveamountsofdatabutonlywanttoanalyzepartofit.TheWHEREstatementsubsetsobservationsthatmeetaparticularcondition.Theconditionscouldcontainsvariousoperators,includingthecomparison,arithmeticcalculation,orlogicaloperation.Charactervaluesmustbeenclosedinquotationmarksandarecasesensitive.Numericvaluesdonotusequotationmarks.ThesubsettingIFstatementcontinuesprocessingonlythoseobservationsthatmeetthecondition.AnalternativetothesubsettingIFstatementistheDELETEstatementinanIF-THENstatement.GeneralformoftheIF-THENDELETEstatement:TheDELETEstatementstopsprocessingthecurrentobservation.Example:Ifwewanttogetthedataaboutfemalesurvivorsinclass1,thenwecanwriteas:
*ThePRINTproceduredisplaysthedataportionofaSASdataset.Bydefault,PROCPRINTdisplaysthefollowing:allobservations,allvariablesandanObscolumnontheleftside.*TheCONTENTSprocedurewiththe_ALL_keywordproducesalistofalltheSASfilesinthedatalibrary.Afterprocessingthestepslistedabove,wecangetthesubsetlikethis:
Chapter1.3ValidatingandCleaningDataAtmosttimes,wecanobservedataerrorsinarawdatafile.Inordertoprocessthem,wehavetolearntheproceduresortechniquesforvalidatingdataandcleaningdata.WhenSASencountersadataerror,theseeventsoccur:-AnotethatdescribestheerrorisprintedintheSASlog.-Theinputrecord(contentsoftheinputbuffer)beingreadisdisplayedintheSASlog.-ThevaluesintheSASobservation(contentsofthePDV)beingcreatedaredisplayedintheSASlog.-AmissingvalueisassignedtotheappropriateSASvariable.Therearefourapproacheswecanusetovalidatethedata.PROCPRINTstepwithVARandWHEREstatements
Detectsinvalidcharacterandnumericvaluesbysubsettingobservationsbasedonconditions.
PROCFREQstepwithTABLESstatement
Detectsinvalidcharacterandnumericvaluesbylookingatdistinctvalues.
PROCMEANSstepwithVARstatement
Detectsinvalidnumericvaluesbyusingsummarystatistics.
PROCUNIVARIATEstepwithVARstatement
Detectsinvalidnumericvaluesbylookingatextremevalues.
1.PRINTProcedureTheVARstatementhelpsselectvariablestobeincludedinthereportandalsoshowstheorderofdisplayofthesevariables.Forexample,IfwecanexaminethemissingvaluesorinvaliddataforthePclassvariable(theobservationthatisnot1,2or3),wecandoitinthisway:
Andfortunately,wegetnothingappearedintheresults.SothereisnomissingvaluesorinvaliddataforTitanicdataset.2.FreqProcedureTheFreqstatementproducesthethefrequencyofdistinctivevaluesfortheacquiredvariables.
PROCFREQDATA=data_set;TABLESvariables;
RUN;Forourexample,wecanexamallthevariablesthroughoneFreqstatement.
*nlevelshelpsusgettheleveltableforourvariables.Thecolumnnamed‘Levels’showsthenumberofdistinctivevaluesforthecorrespondingvariable.The‘MissingLevels’showsthatwhetherornottherearemissingvalues.*noprintintheTablesstatementsuppressthedisplayofseparatetableforeachofvariable.Here’sanexamplefortheseparatetableofvariableAge.
Sothereare177missingvalues.Anditseemslikethatthefirstfiveobservationsarenoteveninteger.DataValidatingProcedure:
Weaddanewvariablenamed“AgeGroup”foranalyzingdatainthenextchapter.Chapter2AnalyzingdataThischapterwilldisplayhowtoanalyzethedatabychoosingdifferentvariablesanddrawingdifferentgraphs.Chapter2.1DataAnalysisviaVisualizationExampleIfwewanttofindtherelationshipbetweenclassofcabinsandthesurvivalrate,wecandrawthehistogramfordifferentclassandcomparetheirsurvivalrate.
Thenwegettheresults:
WecaneasilyobservethatthesurvivalrateforClass1ismuchhigherthanthatforClass2andClass3.Ifpeopleupgradetheirclass,theywillhavehighersurvivalrate.Ifwewanttocombinethehistogramsofthreeclass,thenwecanrunfollowingsteps:
Andgetthefinalresults:
Accordingtothecombinedhistogram,wecanstillobservethatthesurvivalrateforclass1ishighest.Andthedeathrateforclass3ishighest.Therefore,wecanconcludethatthesurvivalratedecreasesalongwiththedowngradeofclassstanding.Similarly,forothervariables,wecanalsodrawthehistogramsforotherimportantvaribles:
(1) Sex
Wecaneasilyobservethatthesurvivalrateforfemaleissignificantlyhigherthanmale.Thismaybeowingtothefactthatmostofmalegavetheirseatsonlifeboatstofemale.
(2)Embarked
WecanobservethatthesurvivalrateforEmbarkedlabeled“C”ishighestanddeathrateforEmbarkedlabeled“S”ishighest.
(2) AgeGroup
WecanobservethatthesurvivalrateforKidishighestandthedeathrateishighestforSenior.Maybeit’sowingtothatkidsarethepriority.However,itwassohardforseniorstogetthroughthisdisaster.
Chapter2.2FreqTableInordertoanalyzetherelationshipbetweenSurvivalRateandothervariables,let’sstartwithgettingsomedescriptivestatisticsofthevariablesofinterestusingfrequencytable.Example:
Results:
Accordingtofrequencytable,74.20%offemalesurvived.Butonly18.89%survived.And68.13%ofthesurvivedpeoplearefemale.
Accordingtothefrequencytable,62.96%ofClass1peoplesurvived.Only24.24%ofClass3peoplesurvived.
Accordingtothefrequencytable,55.36%ofEmbarked“C”survived.However,for“Q”and“S”,survivalratesareonly38.96%and33.70%.
Accordingtothefrequencytable,54.69%ofkidssurvived.However,foradultsandseniors,survivalratesareonly39.69%and36.67%.Chapter2.3LogisticRegressionIfwewanttoperformstatisticaltestonourdataset,thenvariableSurvivedwouldbethedependentvariable.However,thereareonlytwovaluesforSurvived,0and1.Therefore,wewoulduselogisticregressionmodel.Logisticregressionassumesthattheoutcomevariableisbinary(i.e.,codedas0and1).Example:
AboveweusePROCLOGISTICtoestimatealogisticregressionmodel.TheoutcomeSurvivedandthepredictorPclassarebothcategoricalvariablesandshouldbeindicatedassuchontheCLASSstatement.WecanspecifythebaselinecategoryforSurvivedusing(ref="1")andthereferencegroupforPclassusing(ref="1").ThePARAM=REFoptionontheCLASSstatementtellsSAStousedummycodingratherthaneffectcodingforthevariablePclass.PartofResults:
Intheoutputabove,thelikelihoodratiochi-squareof103.5471withap-value<0.0001tellsusthatourmodelasawholefitssignificantlybetterthananemptymodel.Therelativelogoddsofbeingdiedvs.survivedwillincreaseby1.6704ifmovingfromthehighestlevelofPclass(Pclass==1)tothelowestlevelofPclass(Pclass==3).TheoveralleffectofPclasslislistedunder"Type3AnalysisofEffects",anditissignificant.Othervariables:(1)Sex
Thelikelihoodratiochi-squareof268.8512withap-value<0.0001tellsusthatourmodelasawholefitssignificantlybetterthananemptymodel.Therelativelogoddsofbeingdiedvs.survivedwillincreaseby2.5137ifmovingfromthesexFemaletosexMale.TheoveralleffectofSexislistedunder"Type3AnalysisofEffects",anditissignificant.(2)Embarked
Thelikelihoodratiochi-squareof25.8653withap-value<0.0001tellsusthatourmodelasawholefitssignificantlybetterthananemptymodel.Therelativelogoddsofbeingdiedvs.survivedwilldecreaseby0.8919ifmovingfromPclass'S'totheEmbarkedlabeled'C'.TheoveralleffectofPclasslislistedunder"Type3AnalysisofEffects",anditissignificant.(3)AgeGroup
Thelikelihoodratiochi-squareof5.7879withap-value=0.1224tellsusthatourmodelasawholefitsnotsosignificantlybetterthananemptymodel.Therelativelogoddsofbeingdiedvs.survivedwillincreaseby0.6699ifmovingfromAgeGroup'Kid'toAgeGroup'Teenager'.Chapter2.4MultinomialLogisticRegressionIt’ssimilarwithsimplelogisticregressionmodel.Butwecanusemorethanonevariableasdependentvariablesrightnow.Example:
Thelikelihoodratiochi-squareof359.7668withap-value<0.0001tellsusthatourmodelasawholefitssignificantlybetterthananemptymodel.Therelativelogoddsofbeingdiedvs.survivedwillincreaseby1.9055ifmovingfromthehighestlevelofPclass(Pclass==1)tothelowestlevelofPclass(Pclass==3).Therelativelogoddsofbeingdiedvs.survivedwillincreaseby2.6419ifmovingfromthesexFemaletosexMale.TheoveralleffectsofPclasslandSexarelistedunder"Type3AnalysisofEffects",andbotharesignificant.Chapter2.5PredictedProbabilitiesWecanusepredictedprobabilitiestohelpusunderstandthemodel.WecancalculatepredictedprobabilitiesusingtheLSMEANSstatementandtheILINKoption.Formultinomialdata,LSMEANSrequiresglmratherthanreference(dummy)coding,eventhoughtheyareessentiallythesame,sobesuretore-specifythecodingontheclassstatement.However,glmcodingonlyallowsthelastcategorytobethereferencegroup(Survived=0andPclass=3)andwillignoreanyotherreferencegroupspecifications.BelowweuseLSMEANStocalculatethepredictedprobabilityofsurvivalateachlevelofPclass.Example:
e-RequeststhattheLmatrixcoefficientsfortheLSMEANSeffectsbedisplayed.
ilink-Requeststhatestimatesandtheirstandarderrorsinthe"LeastSquaresMeans"tablealsobereportedonthescaleofthemean(theinverselinkedscale).cl-Requeststhatt-typeconfidencelimitsbeconstructedforeachoftheLS-means.Theconfidencelevelis0.95bydefault;thiscanbechangedwiththeALPHA=option.PartofResults:
Thepredictedprobabilitiesareinthe"Mean"column.Thus,forPclass=3,weseethattheprobabilityofbeingdiedis0.7576.ToobtainpredictedprobabilitiesfortheSurvived(=1),wecanreversetheorderingofthecategoriesusingthedescendingoptionontheproclogisticstatement.Thiswillmake1thevalueforSurvivedand3thereferencegroupforPclass.
PartofResults:
HereweseetheprobabilityofbeingsurvivedwhenPclass=3is0.2424,whichiswhatwewouldhaveexpectedsince(1-0.7576)=0.2424,where0.7576istheprobabilitiesofbeingdiedunderthesameconditions.Chapter2.6Conclusion
ThisguidecoversbasicusageofSASondataanalysis.Itshowshowtoreaddata,validatedata,cleandata,drawhistogram,analyzedataandperformlogisticregression.