ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019 · •Categorization of the...
Transcript of ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019 · •Categorization of the...
ABSTRACTING AND AUTOMATING HIERARCHICAL DATA MODELS: LEVERAGING THE SAS® FORMAT PROCEDURE
CNTLIN OPTION TO BUILD DYNAMIC FORMATS THAT
CLEAN, CONVERT, AND CATEGORIZE DATA
TROY MARTIN HUGHES
OCTOBER 2019
Copyright © 2019 Troy Martin Hughes
BIOGRAPHY
2
Troy has been a SAS practitioner for more than 20
years, has managed SAS projects in support of
federal, state, and local government initiatives, and
is a SAS Certified Base, Advanced, and Clinical
Trials Programmer. He has been a frequent
presenter at SGF, SAS Analytics Experience,
WUSS, SCSUG, MWSUG, SESUG, and
PharmaSUG. He has an MBA in Information
Systems Management and certifications including:
PMP, PMI-RMP, PMI-PBA, PMI-ACP, CISSP,
CSSLP, ITIL, CSM, CSD, CSPO, CSP-SM, and
CSP-PO. Troy is a consultant for the Department
of Defense (DoD) and is a US Navy veteran with
two Afghanistan deployments.
Copyright © 2019 Troy Martin Hughes
SAS FORMATS
Used to transform data
• Cleans data by identifying/removing extraneous values
• Converts/standardizes data when alternate forms exist (i.e., entity resolution)
• Categorizes (bins) data into groups or hierarchies
Weaknesses
• Often maintained within SAS code
• Some data models (e.g., hierarchies, taxonomies) require multiple formats
3
Copyright © 2019 Troy Martin Hughes
SAS FORMATS: DSM-5 AND ICD-10
Formats always begin with a data model:
proc format;
value $ dsm
'291.0'='Alcohol dependence with intoxication delirium'
'291.1'='Alcohol dependence with alcohol-induced persisting amnestic disorder'
'291.2'='Alcohol dependence with alcohol-induced persisting dementia';
run;
4
DSM-5 Code ICD-10 Code Disorder Name
291.0 F10.221 Alcohol dependence with intoxication delirium
291.1 F10.26 Alcohol dependence with alcohol-induced
persisting amnestic disorder
291.2 F10.27 Alcohol dependence with alcohol-induced
persisting dementia
Copyright © 2019 Troy Martin Hughes
RAW DATA (NO FORMAT)
data codes;
length dsmcode $8;
label dsmcode='DSM-5 Code’;
dsmcode='291.0'; output;
dsmcode='291.0'; output;
dsmcode='291.1'; output;
dsmcode='17'; output;
run;
proc print data=codes noobs label;
run;
5
DSM-5 Code
291.0
291.0
291.1
17
Copyright © 2019 Troy Martin Hughes
FORMATTED DATA
data codes;
length dsmcode $8;
label dsmcode='DSM-5 Code’;
dsmcode='291.0'; output;
dsmcode='291.0'; output;
dsmcode='291.1'; output;
dsmcode='17'; output;
run;
proc print data=codes noobs label;
format dsmcode $dsm.;
run;
6
DSM-5 Code
Alcohol dependence with intoxication
with delirium
Alcohol dependence with intoxication
with delirium
Alcohol dependence with alcohol-
induced persisting amnestic disorder
17
“17” does not appear in the data
model (SAS format) so its value
is not transformed
Copyright © 2019 Troy Martin Hughes
PROBLEM 1: FORMATS DEFINED IN CODE
Issues
• Modularity is decreased because changes to the data model require (unnecessary)
changes to the underlying software
• Interoperability is decreased because SAS formats cannot be used for other non-SAS
purposes (without parsing the code)
• Master data management (MDM) is compromised because SAS and non-SAS versions
of formats are maintained
Solution
• Maintain dynamic formats external to software (e.g., in XML, Excel, text files, or other
interoperable file formats)
7
Copyright © 2019 Troy Martin Hughes
PROBLEM 2: CAN’T MODEL ONE-TO-ONE-TO-ONE
Issues
• A format can map one value to another value, or can bin many values into one
value, but cannot model one-to-one-to-one relationships (e.g., DSM-5 code to
ICD-10 code to diagnosis name)
Solution
• Maintain a single data model from which formats can be dynamically built (using
CNTLIN)
8
DSM-5 Code ICD-10 Code Diagnosis Name
291.0 F10.221 Alcohol dependence with
intoxication delirium
Copyright © 2019 Troy Martin Hughes
PROBLEM 3: CAN’T MODEL HIERARCHICAL DATA
Issues
• Formats can bin data into categories, but can only bridge between two
hierarchical levels at one time
Solution
• Maintain a single data model from which formats can be dynamically built (using
CNTLIN)
9
Classification 1 Classification 2 Diagnosis Name DSM-5 Code
Substance use and
addictive disorders
Alcohol-related
disorders
Alcohol dependence with intoxication delirium 291.0
Alcohol dependence with alcohol-induced sleep
disorder
291.82
Unspecified alcohol-related disorder 291.89
Substance-related
disorders
Substance dependence with intoxication delirium 292.81
Stimulant use disorder 304.40
Other or unknown substance-related disorder 304.90
Copyright © 2019 Troy Martin Hughes
7 POSSIBLE FORMATS WITH ONLY 3 DATA LEVELS
10
CLASS1
DSM-5 first-level
classificationCLASS2
DSM-5 second-level
classification DSM5name
DSM-5 diagnosis
name
DSM5code
DSM-5 code
1
2
3
4
5
6
7
Copyright © 2019 Troy Martin Hughes
7 POSSIBLE FORMATS WITH ONLY 3 DATA LEVELS
• Conversion of the DSM-5 code to the DSM-5 diagnosis name (e.g., converting “290.0” into “Alcohol dependence
with intoxication delirium”).
• Categorization of the DSM-5 code into the DSM-5 level 2 classification (CLASS2) (e.g., categorizing “290.0” as
“Alcohol-related disorders”).
• Categorization of the DSM-5 code into the DSM-5 level 1 classification (CLASS1) (e.g., categorizing “290.0” as
“Substance use and addictive disorders”).
• Conversion of the DSM-5 diagnosis name to the DSM-5 code (e.g., converting “Alcohol dependence with
intoxication delirium” into “290.0”).
• Categorization of the DSM-5 diagnosis name into the DSM-5 level 2 classification (CLASS2) (e.g., categorizing
“Alcohol dependence with intoxication delirium” as “Alcohol-related disorders”).
• Categorization of the DSM-5 diagnosis name into the DSM-5 level 1 classification (CLASS1) (e.g., categorizing
“Alcohol dependence with intoxication delirium” as “Substance use and addictive disorders”).
• Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1) (e.g.,
categorizing “Alcohol-related disorders” into “Substance use and addictive disorders”).
11
Copyright © 2019 Troy Martin Hughes
BUILD_FORMAT MACRO TO THE RESCUE
Macro Definition%macro build_format(fmtname= /* name of SAS format generated */,
dsnmodel= /* data set in LIB.DSN or DSN format containing data model */,
var1= /* variable (within the model) being transformed or categorized */,
var2= /* variable (within the model) to which VAR1 is transformed */);
Sample Invocations
%build_format(fmtname=DSM5code_to_name,
dsnmodel=DSMmodel, var1=DSM5code,
var2=DSM5name);
%build_format(fmtname=DSM5code_to_class2_,
dsnmodel=DSMmodel, var1=DSM5code,
var2=class2);
12
CLASS1
DSM-5 first-
level
classification
CLASS2
DSM-5
second-level
classificatio
n
DSM5name
DSM-5
diagnosis
name
DSM5code
DSM-5 code
1
2
3
4
5
6
7
Copyright © 2019 Troy Martin Hughes
EXTERNAL XML DATA MAP (DSM5MODEL.MAP)<?xml version="1.0" ?>
<SXLEMAP version="2.1">
<TABLE name="DSM5model">
<TABLE-PATH syntax="XPath">
/TABLE/CLASS1/CLASS2/DIAG
</TABLE-PATH>
<COLUMN name="CLASS1" retain="YES">
<PATH>/TABLE/CLASS1 </PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>50</LENGTH>
</COLUMN>
<COLUMN name="CLASS2" retain="YES">
<PATH>/TABLE/CLASS1/CLASS2 </PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>50</LENGTH>
</COLUMN>
<COLUMN name="DSM5name">
<PATH>/TABLE/CLASS1/CLASS2/DIAG/@DSM5name </PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>100</LENGTH>
</COLUMN>
<COLUMN name="DSM5code">
<PATH>/TABLE/CLASS1/CLASS2/DIAG/@DSM5code </PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>8</LENGTH>
</COLUMN>
</TABLE>
</SXLEMAP>
13
Copyright © 2019 Troy Martin Hughes
EXTERNAL XML DATA MODEL (DSM5MODEL.XML)<?xml version="1.0" encoding="utf-8" ?>
<TABLE>
<CLASS1> Substance use and addictive disorders
<CLASS2> Alcohol-related disorders
<DIAG DSM5name="Alcohol dependence with intoxication delirium" DSM5code="291.0"/>
<DIAG DSM5name="Alcohol dependence with alcohol-induced sleep disorder"
DSM5code="291.81"/>
<DIAG DSM5name="Unspecified alcohol-related disorder" DSM5code="291.89"/>
</CLASS2>
<CLASS2> Substance-related disorders
<DIAG DSM5name="Substance dependence with intoxication delirium" DSM5code="292.81"/>
<DIAG DSM5name="Stimulant use disorder" DSM5code="304.40"/>
<DIAG DSM5name="Other or unknown substance-related disorder" DSM5code="304.90"/>
</CLASS2>
</CLASS1>
</TABLE>
14
Copyright © 2019 Troy Martin Hughes
BUILD_FORMAT IN ACTION
Ingest the XML data map and data model:
* change this location to the location of the XML map file and XML model file;
filename DSM5in '/folders/myfolders/DSM5model.xml';
filename DSMmap '/folders/myfolders/DSM5model.map';
libname DSM5in xmlv2 xmlmap=DSMmap;
data DSMmodel;
set DSM5in.DSM5model;
run;
Sample Invocation
%build_format(fmtname=DSM5code_to_name,
dsnmodel=DSMmodel, var1=DSM5code, var2=DSM5name);
15
Copyright © 2019 Troy Martin Hughes
BONUS – GAME OF THRONES EDITION
The Game of Thrones world contains characters who hail from various regions, and
this data model (spreadsheet) can be exported to CSV and imported into SAS.
SAS® Data-Driven Development, page 339
16
Copyright © 2019 Troy Martin Hughes
BONUS – GAME OF THRONES EDITION
A data set (Favorites) might list the favorite characters and seasons (of fans) as
text-mined from a user forum:
Jaime Lannister,2
Cersei Lannister,7
Daenerys,4
Daneris,7
Daenerys Targaryen,2
Jaime,1
Kit Harington,2
Aria Stark,7
Tyrion Lannister,3
Sansa Stark,5
SAS® Data-Driven Development, page 281
17
Copyright © 2019 Troy Martin Hughes
BONUS – GAME OF THRONES EDITION
The BUILD_FORMAT macro can be used to perform entity resolution:
%build_format(fmtname=variation_to_character,
dsnmodel=GOT_model_tabular,
var1=variation,
var2=character);
data validate;
length newcharacter $50;
set favorites;
newcharacter=put(favcharacter,
$variation_to_character.);
run;
18
Copyright © 2019 Troy Martin Hughes
BONUS – GAME OF THRONES EDITION
BUILD_FORMAT can be used a second time to categorize characters into
regions:%build_format(fmtname=character_to_region,
dsnmodel=GOT_model_tabular,
var1=character,
var2=region);
data validate;
length region $50 newcharacter $50;
set favorites;
newcharacter=put(favcharacter,
$variation_to_character.);
region=put(newcharacter,
$character_to_region.);
run;
19
Copyright © 2019 Troy Martin Hughes
BONUS – GAME OF THRONES EDITION
With two calls to BUILD_FORMAT, the unruly data are cleaned and categorized:
Jaime Lannister,2
Cersei Lannister,7
Daenerys,4
Daneris,7
Daenerys Targaryen,2
Jaime,1
Kit Harington,2
Aria Stark,7
Tyrion Lannister,3
Sansa Stark,5
SAS® Data-Driven Development, page 294
20
Copyright © 2019 Troy Martin Hughes
CONCLUSION
• Complex, hierarchical, and dynamic data models should be maintained external
to code to facilitate data independence.
• Through data-driven software design, external data models support software
modularity, interoperability (beyond SAS), and master data management.
• This text introduces the BUILD_FORMAT macro that dynamically builds
complex and hierarchical SAS formats from XML files, Excel spreadsheets, and
other canonical formats.
21