Sas interview ques

24
What is the difference between compiler and interpreter? give any one example (software product) that act as a interpreter? Both are similar as they achieve similar purposes, but inherently different as to how they achieve that purpose. The interpreter translates instructions one at a time, and then executes those instructions immediately. Compiled code takes programs (source) written in SAS programming language, and then ultimately translates it into object code or machine language. Compiled code does the work much more efficiently, because it produces a complete machine language program, which can then be executed. What is the difference between nodup and nodupkey options? proc sort nodup gets rid of duplicate records with the same sort key but proc sort nodupkey gets rid of other records with the same sort key". "nodup" is an alias for "noduprecs" which appears to mean "no duplicate records" but there is no way sas can know about these duplicate records unless they, by chance, land next to each other in sequence. That is a matter of chance. Take a look at "nodup" at work. Note the record with the "extra" value of 3. It's still there after the "nodup" sort. eg. data test1; input id1 $ id2 $ extra ; cards; aa ab 3 aa ab 1 aa ab 2 aa ab 3 ; proc sort nodup data=test1; by id1 id2; run; options nocenter;

description

Sas interview ques

Transcript of Sas interview ques

What is the difference between compiler and interpreter? give any one example (software product) that act as a interpreter?

Both are similar as they achieve similar purposes, but inherently different as to how they achieve that purpose. The interpreter translates instructions one at a time, and then executes those instructions immediately. Compiled code takes programs (source) written in SAS programming language, and then ultimately translates it into object code or machine language. Compiled code does the work much more efficiently, because it produces a complete machine language program, which can then be executed.

What is the difference between nodup and nodupkey options?

proc sort nodup gets rid of duplicate records with the same sort key but proc sort nodupkey gets rid of other records with the same sort key".

"nodup" is an alias for "noduprecs" which appears to mean "no duplicate records" but there is no way sas can know about these duplicate records unless they, by chance, land next to each other in sequence. That is a matter of chance. Take a look at "nodup" at work. Note the record with the "extra" value of 3. It's still there after the "nodup" sort.

eg.

data test1;

input id1 $ id2 $ extra ;

cards;

aa ab 3

aa ab 1

aa ab 2

aa ab 3

;

proc sort nodup data=test1;

by id1 id2;

run;

options nocenter;

proc print data=test1;

run;

Obs id1 id2 extra

1 aa ab 3

2 aa ab 1

3 aa ab 2

4 aa ab 3

Now look again where the two records with an "extra" value of 3 are next to each other in the input dataset. This time it has been removed by "nodup".

data test2;

input id1 $ id2 $ extra ;

cards;

aa ab 3

aa ab 3

aa ab 2

aa ab 1

;

proc sort nodup data=test2;

by id1 id2;

run;

options nocenter;

proc print data=test2;

run;

Obs id1 id2 extra

1 aa ab 3

2 aa ab 2

3 aa ab 1

If you sort "nodupkey" then you will only be left with one record with that key combination in the above case as you can see below.

data test3;

input id1 $ id2 $ extra ;

cards;

aa ab 3

aa ab 3

aa ab 2

aa ab 1

;

proc sort nodupkey data=test3;

by id1 id2;

run;

options nocenter;

proc print data=test3;

run;

Obs id1 id2 extra

1 aa ab 3

It is a big mistake to think sorting "nodup" will remove duplicate records. Sometime it will, sometime it won't. The only way you can be sure of removing duplicate records is to "proc sort nodupkey" and include enough key variables to be sure you will lose the duplicates you want to lose. In the case shown above, then if we knew of the same "extra" values being duplicates we wanted to remove then this variable should be included in the list of sort variables and then "nodupkey" will remove the duplicates as shown below.

data test4;

input id1 $ id2 $ extra ;

cards;

aa ab 3

aa ab 1

aa ab 2

aa ab 3

;

proc sort nodupkey data=test4;

by id1 id2 extra;

run;

options nocenter;

proc print data=test4;

run;

Obs id1 id2 extra

1 aa ab 1

2 aa ab 2

3 aa ab 3

0'YES'

0'NO'

What is _n_?

Automatic Variables Automatic variables are created automatically by the DATA step or by DATA step statements. These variables are added to the program data vector but are not output to the data set being created. The values of automatic variables are retained from one iteration of the DATA step to the next, rather than set to missing.

Automatic variables that are created by specific statements are documented with those statements. For examples, see the BY statement, the MODIFY statement, and the WINDOW statement in SAS Language Reference: Dictionary. Two automatic variables are created by every DATA step: N and ERROR. N is initially set to 1. Each time the DATA step loops past the DATA statement, the variable N increments by 1. The value of N represents the number of times the DATA step has iterated.

ERROR is 0 by default but is set to 1 whenever an error is encountered, such as an input data error, a conversion error, or a math error, as in division by 0 or a floating point overflow. You can use the value of this variable to help locate errors in data records and to print an error message to the SAS log.

For example, either of the two following statements writes to the SAS log, during each iteration of the DATA step, the contents of an input record in which an input error is encountered:

if error=1 then put infile;

if error then put infile;

What is the difference between proc means and proc summary?

Proc SUMMARY and Proc MEANS are essentially the same procedure. Both procedures compute descriptive statistics. The main difference concerns the default type of output they produce. Proc MEANS by default produces printed output in the LISTING window or other open destination whereas Proc SUMMARY does not. Inclusion of the print option on the Proc SUMMARY statement will output results to the output window.

The second difference between the two procedures is reflected in the omission of the VAR statement. When all variables in the data set are character the same output: a simple count of observations, is produced for each procedure. However, when some variables in the dataset are numeric, Proc MEANS analyses all numeric variables not listed in any of the other statements and produces default statistics for these variables (N, Mean, Standard Deviation, Minimum and Maximum).

Using the SASHELP data set SHOES the example reflecting this difference is shown.

proc means data = sashelp.shoes;

run;

Inclusion of a VAR statement in both Proc MEANS and Proc SUMMARY, produces output that contains exactly the same default statistics.

Using the SASHELP data set SHOES the example reflecting this similarity is shown.

proc means data = sashelp.shoes;

class product;

var Returns;

run;

How can you import .csv file in to SAS?tell Syntax?

AS users and programmers are familiar with the traditional way in which SAS reads external files such as CSV or TXT files using Filename and Infile statement. For example:

Syntex:

The IMPORT procedure statement arguments:

DATAFILE=

DBMS=

OUT=

REPLACE

Data source statements:

GETNAMES=

Other features: PRINT procedure

This example imports the following comma-delimited file and creates a temporary SAS data set named WORK.SHOES. GETNAME= is set to 'no', so the variable names in record 1 are not used. DATAROW=2 begins reading data from record 2. "Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769"

Program

proc import datafile="C:\temp\test.csv"

out=shoes

dbms=csv

replace;

getnames=no;

run;

proc print;

run;

How would you delete duplicate observations?

Before using a particular step to remove the duplicate observations, we should understand that the duplicate records present are pertaining to the key variables like usubjid, treatment, patientno. etc which are unique or exact duplicates (duplicates with respect to all the variables in the dataset).

If the observations are exact duplicates with respect to all the variables in the dataset, we can remove the exact duplicates by:

Using the noduprecs option in the PROC SORT with a by all statement:

proc sort data=dsn noduprecs;

by _all_;

run;

NODUPRECS compares all the variables in the data set and delete exact duplicates.

PROC SQL approach:

Proc SQL noprint;

create table unique as select distinct (*) from dsn;

quit;

Adding Asterisk means that we are telling SAS to identify distinct/unique observations with respect to all variables in the proposed dataset.

If the observations arent the exact duplicates but they are duplicates with respect to some of the key variables in the dataset (ex: usubjid, studyid, patientid, visit etc) then we can remove the duplicates by using a:

PROC SQL approach:

proc SQL noprint;

create table unique as select distinct (usubjid) from dsn;

quit;

by considering usubjid as the unique variable, we are asking SAS to give us the one observation for each unique usubjid.

The same can be done by another approach i.e use proc sort:

Proc sort data=dsn nodupkey;

by usubjid;

run;

NODUPKEY compares only the variables in the data set and delete the duplicate observations pertaining to key variables.

PROC FREQ approach:

Proc freq data=dsn noprint;

tables usubjid/out=unique (keep=usubjid count where=(count=1));

run;

Noprint option is required because we dont want the procedure to print all the unique observations. We just want a dataset with all the unique observations.

Using Datastep approach:

This code keeps only unique observations.

proc sort data=dsn out=temp;

by usubjid;

run;

data unique;

set temp;

by usubjid;

if not first.usubjid and last.usubjid;

run;

If not first.usubjid and last.usubjid , SAS will check the number of observations for each usubjid (key variable) and if any usubjid has any duplicates then SAS will not include them in the output dataset(unique);

data nodups;

set temp;

by usubjid;

if first.usubjid;

run;

0'YES'

0'NO'

Posted On: 09-08-14 at 01:31:20

vishnoiprem

Post Count: 806

Before using a particular step to remove the duplicate observations, we should understand that the duplicate records present are pertaining to the key variables like usubjid, treatment, patientno. etc which are unique or exact duplicates (duplicates with respect to all the variables in the dataset).

If the observations are exact duplicates with respect to all the variables in the dataset, we can remove the exact duplicates by:

Using the noduprecs option in the PROC SORT with a by all statement:

proc sort data=dsn noduprecs;

by _all_;

run;

NODUPRECS compares all the variables in the data set and delete exact duplicates.

PROC SQL approach:

Proc SQL noprint;

create table unique as select distinct (*) from dsn;

quit;

Adding Asterisk means that we are telling SAS to identify distinct/unique observations with respect to all variables in the proposed dataset.

If the observations arent the exact duplicates but they are duplicates with respect to some of the key variables in the dataset (ex: usubjid, studyid, patientid, visit etc) then we can remove the duplicates by using a:

PROC SQL approach:

proc SQL noprint;

create table unique as select distinct (usubjid) from dsn;

quit;

by considering usubjid as the unique variable, we are asking SAS to give us the one observation for each unique usubjid.

The same can be done by another approach i.e use proc sort:

Proc sort data=dsn nodupkey;

by usubjid;

run;

NODUPKEY compares only the variables in the data set and delete the duplicate observations pertaining to key variables.

PROC FREQ approach:

Proc freq data=dsn noprint;

tables usubjid/out=unique (keep=usubjid count where=(count=1));

run;

Noprint option is required because we dont want the procedure to print all the unique observations. We just want a dataset with all the unique observations.

Using Datastep approach:

This code keeps only unique observations.

proc sort data=dsn out=temp;

by usubjid;

run;

data unique;

set temp;

by usubjid;

if not first.usubjid and last.usubjid;

run;

If not first.usubjid and last.usubjid , SAS will check the number of observations for each usubjid (key variable) and if any usubjid has any duplicates then SAS will not include them in the output dataset(unique);

data nodups;

set temp;

by usubjid;

if first.usubjid;

run;

How would you delete duplicate observations?

Before using a particular step to remove the duplicate observations, we should understand that the duplicate records present are pertaining to the key variables like usubjid, treatment, patientno. etc which are unique or exact duplicates (duplicates with respect to all the variables in the dataset).

If the observations are exact duplicates with respect to all the variables in the dataset, we can remove the exact duplicates by:

Using the noduprecs option in the PROC SORT with a by all statement:

proc sort data=dsn noduprecs;

by _all_;

run;

NODUPRECS compares all the variables in the data set and delete exact duplicates.

PROC SQL approach:

Proc SQL noprint;

create table unique as select distinct (*) from dsn;

quit;

Adding Asterisk means that we are telling SAS to identify distinct/unique observations with respect to all variables in the proposed dataset.

If the observations arent the exact duplicates but they are duplicates with respect to some of the key variables in the dataset (ex: usubjid, studyid, patientid, visit etc) then we can remove the duplicates by using a:

PROC SQL approach:

proc SQL noprint;

create table unique as select distinct (usubjid) from dsn;

quit;

by considering usubjid as the unique variable, we are asking SAS to give us the one observation for each unique usubjid.

The same can be done by another approach i.e use proc sort:

Proc sort data=dsn nodupkey;

by usubjid;

run;

NODUPKEY compares only the variables in the data set and delete the duplicate observations pertaining to key variables.

PROC FREQ approach:

Proc freq data=dsn noprint;

tables usubjid/out=unique (keep=usubjid count where=(count=1));

run;

Noprint option is required because we dont want the procedure to print all the unique observations. We just want a dataset with all the unique observations.

Using Datastep approach:

This code keeps only unique observations.

proc sort data=dsn out=temp;

by usubjid;

run;

data unique;

set temp;

by usubjid;

if not first.usubjid and last.usubjid;

run;

If not first.usubjid and last.usubjid , SAS will check the number of observations for each usubjid (key variable) and if any usubjid has any duplicates then SAS will not include them in the output dataset(unique);

data nodups;

set temp;

by usubjid;

if first.usubjid;

run;

0'YES'

0'NO'

Posted On: 09-08-14 at 01:31:20

vishnoiprem

Post Count: 806

Before using a particular step to remove the duplicate observations, we should understand that the duplicate records present are pertaining to the key variables like usubjid, treatment, patientno. etc which are unique or exact duplicates (duplicates with respect to all the variables in the dataset).

If the observations are exact duplicates with respect to all the variables in the dataset, we can remove the exact duplicates by:

Using the noduprecs option in the PROC SORT with a by all statement:

proc sort data=dsn noduprecs;

by _all_;

run;

NODUPRECS compares all the variables in the data set and delete exact duplicates.

PROC SQL approach:

Proc SQL noprint;

create table unique as select distinct (*) from dsn;

quit;

Adding Asterisk means that we are telling SAS to identify distinct/unique observations with respect to all variables in the proposed dataset.

If the observations arent the exact duplicates but they are duplicates with respect to some of the key variables in the dataset (ex: usubjid, studyid, patientid, visit etc) then we can remove the duplicates by using a:

PROC SQL approach:

proc SQL noprint;

create table unique as select distinct (usubjid) from dsn;

quit;

by considering usubjid as the unique variable, we are asking SAS to give us the one observation for each unique usubjid.

The same can be done by another approach i.e use proc sort:

Proc sort data=dsn nodupkey;

by usubjid;

run;

NODUPKEY compares only the variables in the data set and delete the duplicate observations pertaining to key variables.

PROC FREQ approach:

Proc freq data=dsn noprint;

tables usubjid/out=unique (keep=usubjid count where=(count=1));

run;

Noprint option is required because we dont want the procedure to print all the unique observations. We just want a dataset with all the unique observations.

Using Datastep approach:

This code keeps only unique observations.

proc sort data=dsn out=temp;

by usubjid;

run;

data unique;

set temp;

by usubjid;

if not first.usubjid and last.usubjid;

run;

If not first.usubjid and last.usubjid , SAS will check the number of observations for each usubjid (key variable) and if any usubjid has any duplicates then SAS will not include them in the output dataset(unique);

data nodups;

set temp;

by usubjid;

if first.usubjid;

run;

What is the difference between an informat and a format. Name three informats or formats?

Format is to write data, where as informat is to read data Comma, dollar and date are the informats MMDDYYw, DATEw, TIMEw,PERCENTw are informats WORDIATE18, WEEKDATEW are formats

What is the difference between SET and MERGE?

The MERGE statement and the UPDATE statement both match observations from two SAS data sets; however, the two statements differ significantly. It is important to distinguish between the two processes and to choose the one that is appropriate for your application.

The most straightforward differences are as follows: The UPDATE statement uses only two data sets. The number of data sets that the MERGE statement can use is limited only by machine-dependent factors such as memory and disk space.

A BY statement must accompany an UPDATE statement. The MERGE statement performs a one-to-one merge if no BY statement follows it. The two statements also process observations differently when a data set contains missing values or multiple observations in a BY group.

What is the difference btw proc means and proc univariate?

PROC MEANS produces descriptive statistics (means, standard deviation, minimum, maximum, etc.) for numeric variables in a set of data. PROC MEANS can be used for

Describing continuous data where the average has meaning Describing the means across groups Searching for possible outliers or incorrectly coded values Performing a single sample t-test

The syntax of the PROC MEANS statement is:

PROC MEANS ; ;

If the PROC MEANS procedure does not produce the statistic you need for a data set then PROC UNIVARIATE may be your choice. Although it is similar to PROC MEANS, its strength is in calculating a wider variety of statistics, specifically useful in examining the distribution of a variable.

Use PROC UNIVARIATE to examine the distribution of your data, including an assessment of normality and discovery of outliers.

The syntax of the PROC UNIVARIATE statement is:

PROC UNIVARIATE ; ;

Commonly used options for PROC UNIVARIATE include:

DATA= - Specifies data set to use NORMAL - Produces a test of normality FREQ Produces a frequency table PLOT Produces stem-and-leaf plot

Commonly used statements used with PROC UNIVARIATE include:

BY variable list; VAR variable list; OUTPUT OUT = datasetname;

The BYgroup specification causes UNIVARIATE to calculate statistics separately for groups of observations (i.e., treatment means). The OUTPUT OUT= statement allows you to output the means to a new data set. The following SAS program (PROCUNI1.SAS) produces a large number of statistics on the variable AGE:

DATA EXAMPLE;

INPUT TREATMENT LOSS @@;

DATALINES;

;

PROC UNIVARIATE NORMAL PLOT DATA=EXAMPLE; VAR AGE;

HISTOGRAM AGE/NORMAL (COLOR=RED W=5);

TITLE 'PROC UNIVARIATE EXAMPLE';

FOOTNOTE 'Evaluate distribution of variables';

RUN;

Do you prefer Proc Report or Proc Tabulate? Why?

I depends on the requirement where we use it.

Proc Report allows calculations involving input variables so you can have a column of output related to sums/difference/ratios/ whatever between the values of varibles. Proc Tabulate cannot do this except for some very limited percentages. Proc tabulate allows for multiple levels of nesting of categorical variables in both column and row dimensions with Proc Report doesn't.

I use proc tabulate when we need to produce matrix-style reports which is preferred. with proc report we can producecolumnar-style report.

Another important distinction is that PROC TABULATE only produces summary reports (one report row represents a group of observation's statistics) while PROC REPORT can produce either detail (one report row = one observation) or summary reports. PROC REPORT can add extra summary lines using a LINE statement at certain break points on the report. And, PROC REPORT can do trafficlighting based on more than one condition (using and IF statement and CALL DEFINE).

What is the difference between a PROC step and a DATA step?

PROC SQL PROC SQL provides the combined functionality of the DATA step and several base SAS procedures. Less complex and lengthy, but not as legible, code can be written in PROC SQL. PROC SQL code may execute faster for smaller tables. PROC SQL code is more portable for non-SAS programmers and non-SAS applications. PROC SQL processing does not require explicit code to presort tables. PROC SQL processing does not require common variable names to join on, although same type and length are required. By default, a PROC SQL SELECT statement prints the resultant query; use the NOPRINT option to suppress this feature. Knowledge of relational data theory opens the power of SQL for many additional tasks. PROC SQL processing forces attention to resultant data set structures, as SQL is unforgiving of "errors of design". Efficiencies within specific RDBMS are available with Pass-thru code for the performance of joins. Use of aliases for shorthand code may make some coding tasks easier.

PROC SQL;

CREATE TABLE table1

( charvar1 CHAR(3)

, charvar2 CHAR(1)

, numvar1 NUM

, numvar2 NUM INFORMAT=DATE7.)

;

INSERT INTO table1

VALUES('me1','F',35786,'10oct50'd)

VALUES('me3','M',57963,'25jun49'd)

VALUES('fg6','M',25754,'17jun47'd)

VALUES('fg7','F',.,'17aug53'd)

;

SELECT *

FROM table1;

QUIT;

SAS programs are made up of distinct steps, and each one is completed before it moves on to the next one. Data steps are written by you. They are primarily used for data manipulation (hence the name) though in theory you could do some sorts of analysis with them. Proc steps are pre-written programs made available as part of SAS. The code may look similar to a data step in some ways, but the code in a proc step is not giving SAS step-by-step instructions to execute. All you are really doing is controlling how the proc step runs. We will use a few simple procs in the course of this article, but for more details see the SAS documentation.

A step starts with either the word data or the word proc, and ends with the word run;. The run; is often not strictly required, as SAS will assume you want to start a new step when it sees data or proc. However your code will be clearer and easier to understand if you make the end of each step explicit. That may not seem very important the first time you work on a particular program, but when you have to come back to it months later and figure out what you did, you'll quickly see that saving a few keystrokes is far less important than writing clear code. Obviously if you will be sharing this code with anyone else then making it easy to understand is even more important.

DATA table2;

SET table1(WHERE=(var1=value1));

DATA table4;

SET table3

IF var1=value1 AND

var2 IN (value-list);

DATA table7;

MERGE table5 table6;

BY var1;

IF M0D(var4,3) NE 0 THEN DELETE;

What is the purpose of the trailing @? The @@? How would you use them?

@ holds input record for execution of next INPUT within same iteration of DATA step (trailing @ ) @@ holds input record for execution of next INPUT across iterations of DATA step (double trailing @

Sometimes you may need to create multiple observations from a single record of raw data. One way to tell SAS how to read such a record is to use the other line-hold specifier, the double trailing at-sign (@@ or "double trailing @"). The double trailing @ not only prevents SAS from reading a new record into the input buffer when a new INPUT statement is encountered, but it also prevents the record from being released when the program returns to the top of the DATA step. (Remember that the trailing @ does not hold a record in the input buffer across iterations of the DATA step.) For example, this DATA step uses the double trailing @ in the INPUT

statement:

data body_fat;

input Gender $ PercentFat @@;

datalines;

m 13.3 f 22

m 22 f 23.2

m 16 m 12

;

proc print data=body_fat;

title 'Results of Body Fat Testing';

run;

The following output shows the resulting data set: Data Set Created with Double Trailing @ Results of Body Fat Testing 1

Percent

Obs Gender Fat

1 m 13.3

2 f 22.0

3 m 22.0

4 f 23.2

5 m 16.0

6 m 12.0

What is the difference between the SAS v8 and SAS v9?

sas v8:

max.length in v8: 1)member names=32bytes 2)variablenames=32bytes 3)character variable values:32k 4)variable and member labels:256bytes 5)SAS 8 added more output options to ODS in each successive release

sas v9:

max.length in v9: 1)this is 64 bit application. 2)it's included bi tools(etl,olap) 3)it's used for under unix environments also.

How would you delete observations with duplicate keys?

Using NODUPKEY we can delete duplicate data

What SAS statements would you code to read an external raw data file to a DATA step?

INFILE statement.

What is the SAS/ACCESS and SAS/CONNECT?

SAS/CONNECT software is a SAS client/server toolset that establishes connections between networked computers with different operating systems and offers scalability through parallel SAS processing. By providing the ability to manage, access, and process data in a distributed and parallel environment, SAS/CONNECT enables users and applications developers to combine computing resources across varying architectures and SAS releases.

The following are the modes of connectivity:

1)Direct mode 2)Indirect mode

SAS/ACCESS is the component of SAS software which provides the interface to the multiple sources in order to access the data into the SAS environment.

sources are: 1)Interface to the RDBMS 2)Interface to the PC files etc.;

How many types of MERGE?

Types of merges in SAS:

1)one to one merge : It is the merge without any by variables 2)Match merges : Where we are using by variables. Under match merge we have 1-1 , 1-many, and many-many depending on the values in by variables. For many-many merge proc sql will be reliable.

What is the difference between an informat and a format? Name three informats or formats?

Informat - the format used to read the variable from raw data INFORMAT refers to the sometimes optional SAS informat name. The w indicates the width (bytes or number of columns) of the variable. The d is used for numeric data to specify the number of digits to the right of the decimal place.

Format - the format used to print the values of the variable The format is a SAS format or a user-defined format that was previously defined with the VALUE statement in PROC FORMAT. For more information about user-defined formats, see The FORMAT Procedure in Base SAS Procedures Guide. specifies the format width, which for most formats is the number of columns in the output data.

How do you read binary data in sas?

binary data is numeric data that is stored in binary form. Binary numbers have a base of two and are represented with the digits 0 and 1.

proc cimport infile= data=;

run;