Statistical Methods II

21
Statistical Methods II Statistical Methods II Notes on SQL

description

Statistical Methods II. Notes on SQL. SQL Programming. Employers increasingly tell us that they look for 2 things on a resume: SAS and SQL. You have learned A LOT of SAS…but lets focus our attention on SQL. In these notes you will learn: What SQL is Why it is used - PowerPoint PPT Presentation

Transcript of Statistical Methods II

Page 1: Statistical Methods II

Statistical Methods IIStatistical Methods II

Notes on SQL

Page 2: Statistical Methods II

SQL Programming

Employers increasingly tell us that they look for 2 things on a resume: SAS and SQL.

You have learned A LOT of SAS…but lets focus our attention on SQL.

In these notes you will learn:

1.What SQL is2.Why it is used3.The basics of SQL syntax

And, we will go through a few REALLY fun and exciting examples.

Page 3: Statistical Methods II

SQL ProgrammingWhat is SQL?

SQL stands for “Structured Query Language”. It was designed as a language to manage data in relational database management systems (DBMS).

The SQL language is sub-divided into several language elements, including:•Queries, which retrieve the data based on specific criteria. This is the most important element of SQL.•Clauses, which are constituent components of statements and queries. •Expressions, which can produce either scalar values or tables consisting of columns and rows of data.•Statements, which may have a persistent effect on schemas and data, or which may control transactions, program flow, connections, sessions, or diagnostics. •SQL statements also include the semicolon statement terminator. Though not required on every platform, it is defined as a standard part of the SQL grammar.

Page 4: Statistical Methods II

Why is PROC SQL better than Data steps?

•The syntax is transferable to other SQL software packages•You can join up to 250 SAS tables •No need to sort any of the input tables•Simplified syntax than in a normal SAS Data step

When is Proc SQL not better than Data steps?•Uses more memory than any regular data/procedure steps•Could take longer than other procedures when working with very large contributing tables•Logic flow becomes harder to implement

SQL Programming

Page 5: Statistical Methods II

SQL Programming

Why do we use SQL?

SQL is used primarily to:

• Retrieve data from and manipulate tables/datasets• Add or modify data values in a table/datasets• Add, modify, or drop columns in a table/datasets• Create tables and views• Join/Merge multiple tables (whether or not they contain

columns• with the same name)• Generate reports.

Page 6: Statistical Methods II

SQL Programming

Why do we use SQL?

You probably noticed that the previous list includes a lot of things that we do with DATA statements in SAS. In many cases, SQL is a better alternative to DATA statements in SAS – it is more efficient.

Clarification regarding SQL in SAS…We use SQL like Data Statements in SAS…NOT like (most) Proc

Statements. SQL is used to extract data, merge data and create variables…not to analyze data.

Lets take a look…

Page 7: Statistical Methods II

SQL Programming

Consider the Pennstate1 dataset. Lets say that you needed to:

•Only retain sex, earpierces, tattoos, height, height choice, looks and friends variables.•Sort by sex.•Delete observations with more than 4 earpierces.•Create a new variable called HeightDifference which is the difference between their current height and their Height Choice.•Create a new dataset called “Modeling” from the above requirements.

Page 8: Statistical Methods II

SQL Programming

My guess is that at this point, you would use a DATA step and your code would look something like this:

Data Modeling (keep = sex earprces tattoo height htchoice looks friends);

set jlp.pennstate2;where earprces <4;Heightdiff = Htchoice-Height;run;

Proc sort data=modeling;by Sex;run;

This code would run and produce what you need.

Page 9: Statistical Methods II

SQL Programming

Here is what this same requirement would look like using Proc SQL:

proc sql;create table work.modeling asSelect sex,earprces,tattoo,height,htchoice,looks,friends,Htchoice-Height as HeightDifffrom jlp.pennstate2where earprces<4order by sex;quit;

What do you notice about this code that is unexpected in SAS?

Page 10: Statistical Methods II

SQL Programming

Lets pull this apart:

• proc sql; <This is the Proc statement in SAS that calls SQL. Notice that there is

no DATA= reference.>

• create table work.modeling as<In SQL order matters. If you want to retain the dataset (table) which is

being created – rather than just view it – you must have this “create table” statement next. The syntax “create table library.file as” will create a dataset in the designated library with the designated filename. Please note that there is NO semicolon after this statement >

Page 11: Statistical Methods II

SQL Programming

• Select sex,earprces,tattoo,height,htchoice,looks,friends,<this statement functions like a “keep” statement. Note that you could use an “*” to simply include all variables. In SQL we use commas. Again, notice that there is no semicolon at the end of the select statement>

• Htchoice-Height as HeightDiff<this part of the statement is the creation of a new variable – HeightDiff. Notice that the nomenclature is Old variables then new variable. This is different than what we normally do in SAS…which is new variable and then old variables>

• from jlp.pennstate2<this part of the statement is the equivalent to the Set statement in a Data step. It references the source dataset – where everything comes from>.

Page 12: Statistical Methods II

• where earprces<4<This part of the statement looks just like what we would expect to see

in a Data step.>

• order by sex;<This part of the statement is the equivalent of embedding a Proc Sort

in the Data step. Notice that since this is the end of the Proc SQL statement, it is concluded with a Semicolon.>

• quit;<Proc SQL ends with a “quit” rather than with a “run”>

SQL Programming

Page 13: Statistical Methods II

SQL Programming

Lets look at another example…lets focus on categorizing a variable. Consider the UCDAVIS1 dataset.

•Create a new dataset called UCTEST.•Only retain GPA, SEAT, SEX and ALCOHOL.•Create a new variable “GPACAT” which is a categorization of the GPA variable…where <2 is low, <3 is medium and <4 is high.

How would we do this without using SQL and using SQL…

Page 14: Statistical Methods II

Using a Data step, your code probably looks like this:

Data UCTEST (keep = GPA GPACAT SEX ALCOHOL);set jlp.ucdavis1;Format GPACAT $CHAR7.;If GPA = . then GPACAT =" ";else if GPA <=2 then GPACAT = "LOW";else if GPA <=3 then GPACAT = "MEDIUM";else GPACAT = "HIGH";Run;

Proc print data=UCTEST;Run;

SQL Programming

Why do we need this format statement?

Page 15: Statistical Methods II

Using SQL, your code probably looks like this:

PROC SQL;CREATE TABLE work.UCTEST ASSELECT GPA,Sex,alcohol, CASEWHEN GPA = . THEN ' ‘WHEN GPA<=2.0 THEN 'LOW‘WHEN GPA<=3.0 THEN 'MEDIUM‘ELSE 'HIGH‘END AS GPACATFROM jlp.ucdavis1;QUIT;

SQL Programming

What do you notice about this code that is different from the Data step?

Page 16: Statistical Methods II

SQL Programming

Lets look at another example...lets focus on creating a new quantitative variable using a mathematical operator.

Consider the UCDAVIS1 dataset again.

•Create a new dataset called UCTEST1.•Create a new variable that is called “Leisure” which is the amount of TV time plus the amount of Computer time.•Create a new variable that is 2x the sleep variable.•Only retain those sitting in the front and the back.•Sort the data by seat.

How would we do this without using SQL and using SQL…

Page 17: Statistical Methods II

SQL Programming

Using a Data step, your code probably looks like this:

Data UCTEST1 (keep = TV Computer Sleepx2 Seat Leisure);set jlp.ucdavis1;Leisure = (TV + Computer);Sleepx2 = Sleep*2;If seat = "Middle" then delete;Run;

Proc sort data = UCTEST1;by seat;Run;

Page 18: Statistical Methods II

SQL Programming

Using SQL, your code probably looks like this:

PROC SQL;CREATE TABLE work.TEST ASSELECT TV, Computer, Sleep, Seat,(TV + Computer) AS LeisureFROM jlp.ucdavis1WHERE SEAT IN ('Front', 'Back')ORDER BY SEAT;QUIT;

Page 19: Statistical Methods II

SQL Programming

*The general form of PROC SQL includes the following:

PROC SQL;SELECT <LIST THE COLUMNS/VARIABLES TO BE INCLUDED IN THE ANALYSIS OR NEW DATASET>

CREATE TABLE...AS <CREATES A NEW DATASET>

FROM <IDENTIFY THE LIBRARY.FILENAME TO BE USED AS THE SOURCE DATA>

WHERE <IDENTIFY ANY CONDITIONS HERE - LIKE ONLY OBSERVATIONS WITH A GPA >3.0>

ORDER BY <CREATES A SORTING OF THE DATA>;

CASEWHEN <CREATES A NEW VARIABLE FROM AN OLD VARIABLE...UNLIKE IN A DATA STATEMENT, THE VARIABLE WILL ACCOMMODATE ALL VALUE LENGTHS - NOT JUST THE FIRST ONE>

END AS <MUST COMPLETE A CASE CLAUSE>;

QUIT;

Page 20: Statistical Methods II

SQL Programming – Summary Statistics:

http://www.tau.ac.il/cc/pages/docs/sas8/proc/zsumfunc.htmThe above table can found here

Proc SQL syntax Description

AVG, MEAN means or average of valuesCOUNT, FREQ, N number of nonmissing values

CSS corrected sum of squaresCV coefficient of variation (percent)

MAX largest valueMIN smallest value

NMISS number of missing valuesPRT probability of a greater absolute value of Student's t

RANGE range of valuesSTD standard deviation

STDERR standard error of the meanSUM sum of values

SUMWGT sum of the WEIGHT variable values

TStudent's t value for testing the hypothesis that the population mean is zero

USS uncorrected sum of squaresVAR variance

Page 21: Statistical Methods II

Any Questions?