STATA Training Session 2

download STATA Training Session 2

of 45

description

stata2

Transcript of STATA Training Session 2

  • Sun LiCentre for Academic Computing

    [email protected]

    STATA Training Session 2

    Statistical Analysis in STATA

  • Outline

    Resources And Books

    Data Description And Simple Inference

    Group Comparison And Correlation

    General Linear Regression

    Logistic Model

    Binary Logistic Model

    Ordinal Logistic Model

    Multinomial Logistic Model

  • Resources And Books

    CAC Computing Resources for STATA users

    Windows:

    STATA/SE version 10.0

    10-user network perpetual license

    Installation guide

    (http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA-

    Software Questions.aspx)

    Linux CAC Beowulf Cluster:

    STATA/SE version 10.0

    Unlimited users

    About CAC Beowulf Cluster:

    (http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx)

    New features in STATA 10.0 (http://www.stata.com/stata10)

  • Resources And Books

    Website resources:

    The STATA website: http://www.stata.com

    The STATA journal reviewed papers, regular columns, user-written

    software: http://www.stata-journal.com/

    STATA FAQ : http://www.stata.com/support/faqs STATA User Support : http://www.stata.com/support

    Books: http://www.stata.com/bookstore/

    CAC STATA support:

    Website:

    http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx

    Contact:

    For statistical consultation: Sun Li: [email protected]

    For software installation: TAN SuhWen: [email protected]

  • Resources And Books

    Additional recommended readings:

    Regression Models for Categorical Dependent Variables Using

    Stata, 2nd Edition, J. Scott Long and Jeremy Freese

    Logistic Regression with Stata, Xiao Chen, Phil Ender, Michael

    Mitchell & Christine Wells, UCLA

    Statistics with Stata (Updated for Version 9), Lawrence C.

    Hamilton

    Data Analysis Using Stata, Ulrich Kohler and Frauke Kreuter

  • Download Training Slides , data and Syntax:

    http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/T

    raining%20Slides%20and%20Syntax.aspx

  • Data Description & Simple Inference

    Description of Data

    Name: ibmff.dta

    Variables:

    Variable name Variable information

    permno CRSP Permanent Number

    date Numeric date

    ret Holding Period Return

    retx Return without dividends

    mktrf Excess return on markert

    smb Small-minus-big return

    hml High-minus-low return

    rf Risk-free return rate

    umd Momentum factor

  • Data Description & Simple Inference

  • Data Description & Simple Inference

    Convert to STATA date format

    gen year=int(date/10000)

    gen month=int((date-year*10000)/100)

    gen day=date-year*10000-month*100

    gen newdate=mdy(month, day, year)

    format newdate %td

    list date newdate year month day in 1

  • Data Description & Simple Inference

    Distribution of Variables

    pnorm ret

    swilk ret mktrf smb hml rf

    0.0

    00.2

    50.5

    00.7

    51.0

    0

    No

    rmal F

    [(re

    t-m

    )/s]

    0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)

    pnorm: Standarized normal probability plot

    swilk: Shapiro-Wilk normality test with null

    hypothesis that data is normal.

  • Data Description & Simple Inference

    It appears all variables listed are not normally distributed. As variable ret is the variable

    of interest, we adjust its skewness with zero-skewness log function lnskew0, then run

    swilk to test it again.

    lnskew0 lnret=ret

    swilk lnret ret

    For lnret, p-value > 0.05 (by default the significant level is 95%), we do not reject the supposition that data is normally distributed.

  • Group Comparison & Correlation

    Question: To test whether average holding period return for year 1998 has significant

    difference from average of return of all the other years.

    Generate dummy variables

    tab year, gen(dumyear)

  • Group Comparison & Correlation

    tabstat lnret, stat(n mean sd p25 p50 p75) by(dumyear1)

    graph box lnret, by(dumyear1) box(1, bfcolor(blue))

    -.6

    -.4

    -.2

    0.2

    0 1

    ln(r

    et+

    .752

    79

    07

    )

    Graphs by year== 1998.0000

  • Group Comparison & Correlation

    sdtest lnret, by(dumyear1)

    ttest lnret, by(dumyear1)

    sdtest tests the

    equality of variances

    ttest performs one-

    sample T-test and

    independent-samples

    T-test

  • Group Comparison & Correlation

    Question: To test whether average holding period return for all these years are

    significantly different with each other. If yes, then find out which groups have the

    differences.

    oneway lnret year

    oneway lnret year, tabulate bonferroni

    tabulate lists average returns for all the years.

    bonferroni performs multiple comparison btw groups with adjusted p-values.

  • To detect correlations btw profit returns and other factors.

    graph matrix ret mktrf smb hml rf, half

    spearman ret mktrf smb hml rf, stats(rho p) print(.05) bonferroni

    Group Comparison & Correlation

    ret

    excessreturn on

    themarket

    small-minus-bigreturn

    high-minus-lowreturn

    risk-freereturn rate(one monthtreasury bill

    rate)

    -.2 0 .2 .4

    -.2

    -.1

    0

    .1

    -.2 -.1 0 .1

    -.2

    0

    .2

    -.2 0 .2

    -.1

    0

    .1

    -.1 0 .1

    0

    .002

    .004

    .006

  • Group Comparison & Correlation

    Exercise 1

    1. Tabulate the average risk-free return rate by different years

    2. Use help to search command ranksum: Mann-Whitney U-test.

    3. To test if average risk-free return rate in year 2005 is significant different from

    2006 using Mann-Whitney U-test.

    (hint: generate dummy variable first)

    4. Use help to search command correlate: Pearsons correlation.

    5. To identify correlation btw the factors of interest in year 2006 using listwise and

    pairwise Pearsons correlation respectively.

  • General Linear Regression

    General Form of Model

    Data: ibmff.dta

    XY

    . X

    Y

    sy variableexplanator ofmatrix )1( theis

    responses. of vector 1 theis

    pn

    n

    tscoefficien regression theof estimates squareleast YX1

    X)X(

  • General Linear Regression

    Step 1: Examine data

    graph matrix ret mktrf smb hml rf umd

    ret

    excessreturn on

    themarket

    small-minus-bigreturn

    high-minus-lowreturn

    risk-freereturn rate(one monthtreasury bill

    rate)

    momentumfactor

    -.2

    0

    .2

    -.2 0 .2

    -.2

    -.1

    0

    .1

    -.2 -.1 0 .1

    -.2

    0

    .2

    -.2 0 .2

    -.1

    0

    .1

    -.1 0 .1

    0

    .002

    .004

    .006

    0 .002 .004 .006

    -.1

    0

    .1

    .2

    -.1 0 .1 .2

  • General Linear Regression

    Step 2: Perform Linear Regression

    regress ret mktrf smb hml rf umd year

    regress: to perform linear regression

  • General Linear Regression

    sw regress ret mktrf smb hml rf umd year, pe(0.05)

    sw: to perform stepwise regression

    pe(0.05): to specify the significant level of the F-test for addition to the model; items with a p-value less than 0.05 will be included.

  • General Linear Regression

    Step 3: Post-estimation Statistics

    vif //variance inflation factor

    rvfplot //plot residuals against predicted values

    predict fit //store fitted values

    predict sdres, rstandard //store standard residuals

    pnorm sdres //normal probability plot of residuals

    twoway scatter sdres fit //plot residuals against predicted values

    predict cook, cooksd //store Cooks distance statistics

    list year ret cook if cook>4/108 // lists details of those observations for which the

    statistic is above the suggested cut-off point (4/n).

  • General Linear Regression-.

    2-.

    10

    .1.2

    Re

    sid

    uals

    -.2 -.1 0 .1 .2Fitted values

    0.0

    00.2

    50.5

    00.7

    51.0

    0

    No

    rmal F

    [(sd

    res-m

    )/s]

    0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)

  • General Linear Regression

    Exercise 2

    1. Repeat the analysis described in this section after removing the listed possible

    outliers identified by Cooks.

    2. After finishing Q1, repeat the analysis but treat the variable year as the

    categorical.

    hint: use command

    xi: sw regress ret mktrf smb hml rf umd i.year, pe(0.05)

  • Logistic Model

    Binary logistic model: dichotomous response outcomes

    e,.g.: presence or absence of an event

    Ordinal logistic model: ordinal response variable with more than two ordered categories

    e,.g.: a 5-point Likert scale

    Multinomial logistic model: nominal response variables with more than two categories

    e,.g.: different types of programs in school

  • Binary Logistic Regression

    General Form of Model

    is the Odds Ratio that when increases by one unit and all other

    covariates remain the same.

    Binary responses are typically coded as 1 for the event of interest, and 0 for the

    opposite event.

    )|( iii xyE

    pipiiiii xxx ...))1/(log()(logit 22110

    ))'exp(1/()'exp( iii xx

    )exp( k 1y kx

    Y

  • Binary Logistic Regression

    Description of Data

    How to identify a person with high chance of getting defaults on the bank loan. We have

    700 records from bank database (bankloan.csv) .

    Variable name Variable information

    age Age in years

    ed Level of education

    1= didnt complete high school 2= high school degree

    3= college degree 4= undergraduate 5= postgraduate

    employ Years with current employer

    address Years in current address

    income Household income in thousands

    debtinc Debt to income ratio (*100)

    creddebt Credit card debt in thousands

    othdebt Other debts in thousands

    default Previously defaulted (1=Yes; 0=No)

  • Binary Logistic Regression

    Step 1: Import and examine data

    insheet using bankloan.csv

    d

    browse

    codebook default

  • Binary Logistic Regression

    tabstat age employ address income debtinc creddebt othdebt, by(default)

    table ed, c(mean income mean age mean debtinc mean creddebt mean othdebt) by(default)

  • Binary Logistic RegressionStep 2: Construct logistic model

    logistic default age ed employ income address

    estimates store model1

    logistic default age ed employ income address debtinc creddebt othdebt

    lrtest model1 .

    sw logit default age address employ income debtinc creddebt othdebt, pe(0.05)

    logistic: produces odds ratios.

    logit: produces parameter coefficients.

    estimates: saves the current likelihood and all the estimates.

    lrtest: produces p-value of likelihood-ratio test.

  • Binary Logistic Regression

    Step 3: Post-estimation statistics

    predict prob

    predict resi, rstandard

    hist resi

    estat gof

    estat gof: goodness-of-fit test0

    .2.4

    .6.8

    De

    nsity

    -5 0 5 10standardized Pearson residual

  • Binary Logistic Regression

    estat classification

    Summary of correct

    predictions

    Summary of incorrect

    predictions

    Overall success rate

    This is calculated based on 50% as a

    cut-off point for positive

    predictions.

  • Binary Logistic Regression

    gen z=_b[debtinc]*debtinc+_b[employ]*employ+_b[creddebt]* creddebt+_b[address]*address

    line prob z, sort0

    .2.4

    .6.8

    1

    Pr(

    de

    fault)

    -10 -5 0 5 10z

  • Binary Logistic Regression

    gen empcat=employ>5

    logit default address empcat debtinc creddebt

    postgr3 debtinc, by(empcat) //you need to install postgr3 package

    0.2

    .4.6

    .8

    0 10 20 30 40debtinc

    yhat_, empcat == 0 yhat_, empcat == 1

    postgr3: graphs the predicted

    values , holding all other variables

    constant at specified values (default

    is the mean).

    Marginal impact is higher for

    people with short service than

    for those with long service in

    their current company.

  • Binary Logistic Regression

    Exercise 3

    1. Explore the use of commands lroc and lsens to diagnostic data and interpret

    results.

    lroc: graphs the ROC curve and calculates the area under the curve.

    lsens: graphs sensitivity and specificity versus probability cutoff.

    2. Predict the probability of default on bank loan for a person with

    debt/income ratio of 22.7, 2 years with current employer, 16 years living in

    current place, and 1.21 thousand credit card debt.

  • Ordinal Logistic Model

    General Form of Model

    .

    xpp

    pppp

    xp

    pp

    ')(1

    log)(Logit

    '1

    log)(Logit

    20

    21

    2121

    10

    1

    11

    1... and

    ')...(1

    ...log)...(Logit

    .

    121

    0

    21

    2121

    kk

    k

    k

    kk

    pppp

    xppp

    pppppp

    represents Odds Ratio that for any s when increases by one unit and all

    other covariates remain the same.

    Ordered responses with k categories can be formulated as a threshold model.

    )exp( k say kx

  • Ordinal Logistic Model

    Construct model

    recode income (min/20=1 "

  • Ordinal Logistic Model

    xi: ologit inccat age i.ed employ debtinc, or

    listcoef, help oligit: to perform ordered logistic regression.

    listcoef: to obtain ORs and change

    of odds for a sd of the variable.

  • Ordinal Logistic Model

    xi: omodel logit inccat age i.ed employ debtinc

    brant, detail

    Test parallel regression assumption

    (proportional odds assumption):

    omodel: to perform likelihood ratio

    test.

    brant: to do Brant test.

  • Ordinal Logistic Model

    prtab employ //predicted probabilities for each of the values of the variable specified

    prvalue, x(_Ied_2=1) //predicted probabilities for selected values of variables

    prvalue, x(_Ied_2=1 age=28 employ=3 debtinc=10)

  • Multinomial Logistic Model

    xi: mlogit inccat age i.ed employ debtinc

  • Multinomial Logistic Model

    listcoef

    fitstat

    prtab _Ied_2

  • Multinomial Logistic Model

    predict p1 p2 p3 p4 p5

    summarize p1 p2 p3 p4 p5

    sort employ

    twoway connect p1 p5 employ, msym(i i)

    0.2

    .4.6

    .81

    0 10 20 30employ

    Pr(inccat==1) Pr(inccat==5)

  • Logistic Model

    Exercise 4

    1. Try to construct probit models.

  • End