Chapter 2: Preparing Data Using ... - Open Online Coursesopenonlinecourses.com/spc/2 Preparing...

Chapter 2: Preparing Data Using Structured Query Language (SQL)

Revised on Sunday, February 11, 2018By Farrokh Alemi version Tuesday, May 30, 2017Edited by Nancy Freeborne PhD on February 3rd 2017

IntroductionData in electronic health records is in large number of tables, linked to each other through

primary keys. Before any analysis can be done, data must be prepared in a matrix format so that all relevant variables are present in the same table. Many statistical books do not show how this could be done and thus leave the analyst at a disadvantage in handling data from electronic health records. These books do not teach use of Standard Query Language (SQL). This book could also be used without knowing SQL commands, perhaps the reader can skip sections that describe the application of the SQL command. We do not recommend it but it can be done. We take a different approach from most statistical books and believe that SQL and data preparation are important components of data analysis. An analyst who wants to handle data in electronic health records needs to know SQL. The book teaches SQL because it matters in statistical analysis of the data. Accurate statistical analysis requires careful data preparation. Statisticians who learn statistics without deep understanding of data preparation may remain confused about their data. It is akin to living your life not knowing your parents, where you came from, or for that matter who you are. You can live your life in a fog but why do so? Knowing the source of the data and its unique features can help the analyst have insights into anomalies in the data. Manipulating the data, preparing it for analysis, looking at the data in different ways with different definitions, gives the analyst intuitions and insights into the data.

SQL gets to what statisticians do with most of their time. Statisticians spend more time on preparing the data than actually conducting the analysis. Perhaps 80% of data analyst’s time is spent in preparing data. That is a large portion of a statistician’s day to day tasks. Ignoring it would significantly handicap the statistician. Knowing SQL helps with the bulk of what statistical analysts do. Training in it is essential.

Of course, we do not need the statistician to become a computer programmer. This, after all, is a course on statistics and not computer programming. Thankfully, SQL programming is relatively easy (there are few commands) and can be picked up quickly. This chapter exposes the reader to the most important SQL commands. These include select, group by, where, join and some key functions. These commands are for the most part sufficient to do everything in this book. These commands help the analyst do many different transformations of data. While there are few SQL commands, the complexity of the data makes use of these commands difficult. A great deal of practice, and many checks and balances within the code, are necessary to make sure that the commands are appropriately used and the transformed data are meaningful.

Decisions made in preparing the data could radically change the findings. These decisions need to be made carefully and transparently. The statistical findings depend on these decisions and the analyst must make every attempt to communicate the details of these preparations to the manager. Managers and policymakers must assure that decisions made in preparing the data are well thought out or they will be viewing erroneous data findings and possibly making incorrect major strategic decisions. Some common errors in preparing data include the following:

1

1. Visits and encounters reported for deceased patients. For example, when a patient’s date of visit or date of death is miss-entered then it may look like dead-patients (zombies) are visiting the provider. Errors in entry of dates of events would skew results and thus cleaning up these errors are crucial.

2. Inconsistent data (e.g. pregnant males) must be identified and steps must be taken to resolve these inconsistencies.

3. Incongruous data, such as short patient stay in patients with a medication error, must be reviewed. When a medication error occurs, one would expect to see that these patients have longer hospital stays. If that is not the case, one should review the details of the case to see why not.

4. Steps should be taken to resolve missing information. Sometimes, missing information could be replaced with the most likely response; other times missing information could be used as a predictor. For example, if a diagnosis is not reported in the record then it is likely that the patient did not have it. At the same time, the reverse could be true. If an emergency room patient is missing key diagnoses, then it is likely that there was no time to diagnose the patient but the patient had the diagnoses. Several studies have found that missing diagnoses in emergency rooms increases risk of subsequent mortality suggesting that in these situations a missing diagnosis does not mean that the patient did not have it. Before proceeding with the analysis, missing values must be imputed. There are many different strategies for dealing with missing values and the rationale for each imputation should be examined.

5. Information is double counted because of errors in joining tables. A common error is when in process of joining tables, analysts use a non-unique primary key and duplicate sets of data.

In short, a great deal must be done before any data analysis commences.

Linking Data in Multiple FilesData in electronic health records are often simultaneously present in multiple files.

Patient information is in one table. Prescription data is in another. Data on diagnoses are often in an outpatient visit table. Hospital data are still in another table. An important first step in any data analysis is to pull various variables of interest into the same table. These efforts lead to a large, often sparse, table where all the variables are present but many have missing values. The reason for the potentially missing values is that for instance, Patient X could have a diagnosis and prescription data but not hospital data if he/she was never hospitalized. Patient Y could have a diagnosis, prescription data, and hospital data but missing some other data (for instance surgical procedure if he/she did not have any surgery). The procedure to pull the data together requires the use of Standard Query Language (SQL). An introduction to SQL is beyond the scope of this book but we discuss some SQL commands that are often used in merging the data so that the reader is familiar with the concept. Throughout this book and especially in this chapter, we give brief SQL commands for preparing the data, so the reader can familiarize themselves with the key terms used in SQL.

There are different implementations of SQL. In this chapter, we use the Microsoft Access SQL. Access is widely available to managers. Other versions of SQL such as dynamic SQL or Microsoft Server SQL are also available. If the reader is familiar with the concept of

2

code laid out here, he/she can also find the equivalent version of the code in a different language on the web. To assist the reader, We also use also provide code using Microsoft SQL server.

In EHRs, data reside in tables. A table is a collection of fields, and fields are variables that provide information. One of the fields in the table is a primary key and referred to as ID for the table. A primary key is a unique number for each row of data in the table. All of the fields in the table provide information about this primary key. For example, we may have a table about the patient, which would include demographics and contact information, and a separate table about the visit. The primary key for the patient is a patient identifier such as medical record number. A primary key for the visit table is a visit ID. The fields in the patient table (e.g. address) are all about the patient; and the fields in the visit table (e.g. diagnoses) are all about the visit. The relationships among the tables are indicated through repeating the primary key of one table in another table. In these situations, the key is referred to as a foreign key. For example, in the visit table we indicate the patient by providing the patient ID. Database designers do not provide any other information about the patient, e.g. his/her address, in the visit table so as to have efficient databases. In other words, databases use as little information as they can to preserve space and to improve data analyses time.

As an example assume that you need to prepare a database that contains three entities: Patients, Providers and Encounters. For each of these three entities, we need to create separate tables. Each table will describe the attributes of one of the three entities. Each attribute will be a separated field. For simplicity assume that the patient attributes are assumed to be first name, last name, date of birth, address (street name, street number, city, State and zip code) and email. First name is a string with maximum size of 20 characters. Last name is a string with maximum size 50 characters. Street number and zip code are integer numbers with no decimals. Date of birth is a date displayed in the format DD MM YY (e.g. 19 Jan 04). The possible values for the State are: Maryland, Virginia, District of Columbia and other. Patient's telephone number could be numbers read as text (especially if they include parentheses and other non-number characters) or number. A patient ID (auto-number) is used as the primary key for the table. Table 1 shows the first three rows of data in our hypothetical patient table. Note that two patients are shown to live in the same household and have same last names.

Table 1: Three Rows of Data for Example Patient Table

First Name

Last

Name

Zip Code City State

Date of Birth Email

Telephone

of the Patient

Larry Kim 22101 Mclean DC 08-Jan-54 [email protected] 703-9934226

George Smith 22102 McLean Virginia 09-Sep-60 [email protected](703)

8884545Jill Smith 22102 McLean Virginia 01-Aug-89 [email protected] 703 993 4226

The provider attributes are assumed to be first name (text of maximum size 20), last name (text of maximum size 50), whether they are board certified (a yes/no value), date of hire (displayed in format DD MM YY), telephone (text or number) and email (text of size 75). Employer's ID number should be the primary key for the table. Table 2 shows the first three rows of data for providers; note that one of the patients is also a provider.

3

Table 2: Three Rows of Data for Example Providers Table

First Name Last Name Board Certified Email Telephone Employee

IDJim Jones Yes [email protected] 3456714545 452310Jill Smith No [email protected] 3454561234 454545

George John Yes [email protected] 3104561234 456734

The encounter entity is assumed to have the following attributes: patient ID, provider ID, diagnosis (text of maximum size 50), treatment (text of size 50) and date of encounter (date entered in the format DD MM YY). Each encounter should have its own ID number. Table 3 shows the first five rows of the encounter table.

Table 3: Five Records for Example Encounters TableID Patient ID Provider ID Date of Encounter Diagnosis Treatment1 1 452310 10-Jan-04 Hypertension Assessment2 1 452310 17-Jan-04 Heart Failure Monitoring3 2 452310 10-Jan-04 Null Assessment4 3 452310 10-Jan-04 Hypertension Assessment5 1 454545 10-Jan-04 Asthma Education

Figure 1 shows the relationship between patient, encounter, and provider entities in our hypothetical electronic medical record. In the encounter table, we have two foreign keys: patient ID and provider ID. These foreign keys link the encounter table to the patient and provider tables.

Figure 1: Example of Relationship among Three Hypothetical Tables

4

Sometimes, when the relationships are complex, a junction table is created to keep track of the relationships. A junction table is a table about relationships. For example, a patient may have multiple street addresses. It does not make sense to provide more than one addresses in the patient table. A separate table is made for addresses and a foreign key in the patient table would point to the address table. A junction table may be inserted in between the patient and the address table clarifying the type of address.

Making a QueryDatabases contain a great deal of information in separate tables. An analyst is often

called upon to integrate these separate tables so as to find answers to specific questions. This process is called querying a database. SQL is the list of commands that can be used to query a database. An example is provided below:

SELECT Claims.PatientID, LAST(ICD9.ICDDescription) AS LastOfICDDescription FROM Claims INNER JOIN icd ON Claims.DiagnosisCode = ICD9.ICD9Codes WHERE ICD9.ICDDescription LIKE "%diabete%" GROUP BY Claims.PatientID ORDER BY Claims.PatientID;

In the above SQL code, all commands are in capital letters. The “SELECT” command identifies the variables that should be in the new, integrated, table. In specifying a field, you need to also indicate the source table. Thus “Claims.PatientID” indicates that “PatientID” comes from the “Claims” table. Fields can be included from all joined tables. In this fashion, a select query makes information in various tables available in one new table. In addition, the new variable can be calculated from more than one other variable.

The “FROM” command specifies which tables should be used. If the data are in more than one table then the tables must be joined. The “INNER JOIN” sub-command of FROM specifies the keys that should be matched in joining two tables. There are at least four different types of join with difference consequences.

1. One-to-One Join: This is the most common join. The fields in both tables must be exactly the same before the content of the tables are joined together. For example, one might have a table Claims that contains diagnoses codes. The meaning of these diagnostic codes might be available in a separate table called [Diagnosis Codes]. A join can select the text for the diagnostic code and combine it with the data in Claims table. A one-to-one join will lead to listing of all claims in which the diagnostic code has a corresponding text in diagnosis table. In a one-to-one join, if the description of the diagnosis code is missing in the Diagnosis Code table, then all corresponding claims will not show. It is important to be careful that as a consequence of a one to one join important data are not missed. For example, suppose that we have two tables described in Table 4, one containing description of diagnosis codes and another encounters that refer to diagnoses. The description table includes text describing the nature of the diagnosis. The encounter table includes no text and just IDs that can be used to connect to the description table. We can make one to one join between these two tables by joining “Diagnosis ID” in encounter table with “ID of Code” in the Description table. Joining these two tables will allow us to see a description for each diagnosis, except diagnosis 6, which is not in our description table. The last row for encounter will be dropped

5

because there is no “Diagnosis ID” 6 in the description table. In one to one joins, care should be taken that important information are not lost because of mismatch in IDs.

Table 4: Encounter and Description TablesDescription of Diagnosis Codes

ID of Code

Diagnosis Code Description

Hospital ID

1 410.05 Acute myocardial infarction of anterolateral wall 12 250.00 Diabetes mellitus without mention of complication 13 250.01 14 410.05 Acute MI of anterolateral wall 25 250.00 Diabetes mellitus w/out mention of complication 2

EncountersPatient

ID Provider ID Diagnosis ID Hospital ID1001 12 1 1123 240 5 2150 2555 6 1

2. Left and Right Join: The other two joins allow the field in one table to be always included and the field from in the other table included only when it matches. When the two fields do not match, the record is still kept but there will be a null value in place of the missing match. Following with the previous example, in a one-to-many join, we can display all claims and their corresponding text for diagnosis. Then if the diagnosis text is missing in the Diagnosis Code table, a missing value is entered. Claims data are still there but the description of the diagnosis will be null. Thus, using information in Table 4, the last row in the encounter table will remain in the analysis but all fields in the description table will be null for this row of data.

For example, a left join of A, B, and D to B, C, and D will result in (A, null), (B, B), and (D, D). A right join will result in (B, B), (null, C), and (D, D).

3. Full Join : A full join is a union of both left and right join. For example, in the example provided in left and right join, the full join will result in: (A, null), (B, B), (null, C), and (D, D). Full joins are helpful when a complete set of primary keys of both tables are needed in later steps of the analysis.

4. No Join (Cross Join): The last type of join occurs when two tables are present and not joined together through any field. In this circumstance, every record in one table is coupled with every record in another table. For example if one table has two records A and B, and another table has records 1, 2, 3, and 4. Then the consequence of having the two tables in a query but without a join is 8 records of all possible combinations of the two tables: A1, A2, A3 A4, B1, B2, B3, and B4. Cross joins are problematic as they exponentially increase the size of the data.

6

Expressions and FunctionsThere are a number of functions available in SQL commands that allow one to calculate

the value of the new variable. These functions include arithmetic operations such as add or divide; text operations such as concatenate; and date operations such as days since, and logical operations such as maximum and if. In the select command you can also use the AS statement to rename the new field. SQL usually does not allow for space in a name; if there is space in a name one must put the name inside a bracket. Here is an example of an expression for computing a new field called “Diagnosis” using the “IIF” expression:

IIF (ICD9!Description like ‘%diabetes%’, ‘Diabetes’, ‘Other’) AS Diagnosis

This expression tells us that if the field “Description” in the table “ICD9” contains the word “diabetes”, the system will assign to the field “Diagnosis” the value “Diabetes,” otherwise it will assign to the field “Diagnosis” the value “Other.” Here is another expression:

Date() - Claims.Date AS DayTillNow

This expression tells us that we should set the field “DaysTillNow” to be equal to the number of days of difference between today and the field called “Date” in the “Claims” table. The expression “Date()” is a built-in function of Microsoft that provides today's date; this expression may be different in various SQL versions. The exact format and meaning of various SQL functions are available by doing a key word search on the Internet.

The “WHERE” command allows one to filter the data and select only a specific subset of the data. A “WHERE” command uses one, or more, criteria. The records or rows in a table are reduced to the rows that meet the criteria. For example, we might have a table of claims and want to restrict it to patients who had a claim of influenza. Here are some more examples of criteria: The criterion “>120” used after WHERE means that the field must be greater than 120. The criterion “> 6/12/05” means the field entry must be passed June 12th 2005. The criterion Not “*ism” means that the field cannot have any text ending with the word suffix “ism.” The criteria “<Date()” means before today’s date. The criterion “Is Null” means that there are no data in the field. The criterion “Not Diabetes” means not matching exactly to the word diabetes, diabetic will be allowed as it does not match diabetes. The criterion Like "dia*" matches any text that starts with “dia,” such as diabetes, dialog, diagram and so on. The criterion Between A and D will match any text starting with A, B and C. The “LIKE” criterion selects rows of data that contain a specific text. For example, LIKE “*diabete*” tells the software to select rows that contain the referenced text anywhere in the field. It will select “patient had diabetes” as well as “Diabetic patient”. The stars in the reference text indicate where wildcard letters can be.

Finally the GROUP BY command tells the software to summarize the values in a column. The command “GROUP BY Claims.PatientID” restricts the entries in the PatientID field to one value per ID. So, if there are multiple values of PatientID, e.g. if the same patient had multiple visits, this will restrict the analysis to one entry per patient. You would need to select the most recent visit, the average of the visits or some other way to summarize the multiple visits of the same patient. The various commands that can be used to summarize a field include:

1. Group by, where all records having the same value will be grouped into one.2. Average, where in all records in the group are averaged

7

3. Standard deviation, where the standard deviation of all records in the same group are calculated.

4. Count, where all duplicate values in the group are counted.5. Max and min functions, where the maximum or minimum value for the field in the

group are selected. Maximum of a date will select the most recent value. 6. Last or first, where the last or first value in the field in the group are selected.

If you summarize one field in your query, all fields must be summarized. If you need to select a subset of the query, you can use the command "WHERE" in a “GROUP BY” query. In a "WHERE" command, only grouped records matching the criterion under the where field are included in the query. The WHERE command is executed before summarizing the data. If you wish to apply a criterion after summarizing the data, you can use the command “HAVING”.

For ease of use, Graphical Interfaces are available for specifying SQL code, for example Microsoft Access has a helpful graphical interface for doing so.

Types of FilesThe command USE AgeDx tells the SQL server to use the database names AgeDx. In addition, the query must identify the table that should be used. The command to reference a particular table starts with reserved word FROM:

FROM dbo.data

This command says that the query is referencing the permanent table named data. One can also reference temporary tables such as:

FROM #data

The hash tag preceding the word data says that the query is referencing the temporary table called #data. A double hash tag indicates a temporary file that can be referenced from multiple queries. This is particularly helpful in transferring temporary data to procedures.

Data errors: Visits after DeathBefore the data are merged from different files, it is important to exclude records of

patients that are impossible. For example, sometimes a patient is reported to have a visit after death. Clearly this is not possible. In rare situations some visits occur after death—for instance, the transport of a dead patient from home to hospital may be possible. Other scenarios can also exist where the patient has an encounter after death. These are rare and very specific to post mortality services. The most common reason for encounters with health care system after death is due to date of death being entered wrong by a clinician or clerk in the healthcare system. This is usually not the case, if we are working with Center for Medicaid and Medicare Services (CMS) Death Master List; but in recent years CMS has decided against sharing this master list because of potential for identity theft. Therefore, we may be left with an erroneous date of death. One of the first steps in cleaning the data is to identify patients whose date of death occurs before date of various outpatient or inpatient encounters. The code for identifying such discrepancies may look as follows:

SELECT id

8

INTO #ExcludeFROM DxWHERE [Age at Death]<[Age at Dx]GROUP BY id;

In this code, the “SELECT id” command tells the system that we are interested in finding the ID of the patients. Note that we do not need to identify the source of the ID field as this query has only one table, so there is no confusion where the variable comes from. It is always best to specify so there is no room for confusion. Thus we should have used: “SELECT Dx.id”. The INTO command says that we should include these IDs into a file called Exclude. The “FROM Dx” command says that we want to get this information from a table called Dx, which incidentally includes both age at death and age at diagnosis. The “WHERE [Age at Death]<[Age at Dx]” command says that age at death must be less than age at diagnosis. The “GROUP BY id” says that we want to see only one value for each ID no matter how many times the person’s diagnosis occurs at a later age than the age of his death. Also note that the WHERE command will be executed before the GROUP BY command.

Data errors: Visits before BirthIf birthdates are wrong, then patients may show for visit prior to birth. In these

situations, it is important to identify the person and exclude the entire record of the person.

SELECT idINTO #ExcludeFROM DxWHERE [Age at Dx]<=0GROUP BY id;

SELECT d.* FROM dbo.Data d left join #exclude e where d.id=e.idWHERE e.id is null

This code says that we should select the id of the person, one id per person. We will put the patients who have one or more diagnoses prior to birth in the temporary file #Exclude. We identify the patients by having a diagnosis with 0 or negative age. The second part of the code starts with selecting all variables in the permanent file data. We have set this file to have alias d. It is left joined with temporary file #Exclude. The temporary file has alias e. The code says that we should restrict the analysis of the permanent data file to patients whose id is not showing in the left join with the temporary file #exclude.

Data Errors: Patients with No VisitsIn many studies, we are looking for encounters with the healthcare system for a patient.

Sometimes the entries in a medical record are not for real people, as when a test case was entered. These patients are typically identified typically with primary keys that start with ZZZ. These records must be excluded before proceeding.

9

For some patients, during the study period there are no encounters with the health care system. This creates doubt about whether these patients are real or simply healthy. If the period is long, say a decade then some encounters with the healthcare system is expected. In the absence of any encounter, it is important to explore why this may be occurring. For example, the patient may be using the facilities for picking up his/her medications but not for receiving medical services. Other explanations are also possible. It is important to count how many patients have no encounters with the system and find out what is the most likely explanation of this finding.

Data Errors: Imputing Missing ValuesMissing values are another common error. Keep in mind that the medical record is the

report of an encounter. Patients may have encounters that are not reported as when the patient meets the clinician in a social gathering. A question arises regarding what should be done with information not reported in the electronic health record. The answer depends on what the information is. For example, let us think through what should be done when the patient has no report of having diabetes. One possibility is that the patient was not diagnosed with diabetes, in which case we can assume that the patient does not have diabetes. In this case, we will set the value of the field diagnosis to 0:

IIF(Diabetes1 is null, 0, Diabetes1) AS Diabetes2

This command says that if the field Diabetes1 is null, replace it with 0 and otherwise assign it the value in Diabetes1 and rename the new field as Diabetes2.

One general strategy for imputing missing values is to assume that missing values are the most common value. Thus, if the diagnosis of diabetes is missing, if most patients in our study do not have diabetes diagnosis, then the best approach may be to assume that the patient does not have it. However, we cannot always assume that a missing diagnosis indicates normal condition. In the emergency room, where there may not be time to diagnose the patient a missing diagnosis may indicate insufficient time to establish it and may indicate that serious conditions were present. In one study, for example, in an emergency room, we found that missing values of myocardial Infarction (heart attack) diagnosis were highly correlated with mortality. In this situation, it is not right to assume that missing diagnoses indicate normal condition.

Treatment is usually reported when given. It may be safe to assume if treatment is not reported that it was not given. Again it may be important to understand if patient conditions precluded giving the treatment.

Sometimes when a value is missing, the best approach is to measure it from its nearest values. For example, if the blood pressure is missing in a particular quarter it may make sense to estimate it from the next most recent time period. Other times we can impute the missing value from available data. Thus, we can impute that the patient is diabetic, if he/she is taking diabetic medication or if hemoglobin A1c levels (a marker for diabetes) indicate diabetes. Exceptions occur, especially if medications are used to impute diagnosis. Some pre-diabetic patients take Metformin, a drug also used for diabetes, so therefore indicating that those patients had true diabetes would be erroneous. Physicians may prescribe a medication for a reason different than the typical use, , so if we see the medicine in the database, we may think it indicates a diagnosis when it does not. In these situations, we would need to understand if the rest of the patient’s record supports the imputation.

10

Data Errors: Inconsistent DataOne way to have more confidence in data in medical record is to find out if the data are

consistent. A patient whose age shows as more than 110 years may have an erroneous date of birth. Here is a sample SQL code to detect if age is above 110 years old:

SELECT Id, [Age at Encounter]FROM EncounterWHERE [Age at Encounter]>110;

A pregnant male is not be a reasonable record. Here is a code to detect if male subjects are pregnant:

SELECT IdFROM Table WHERE Gender = ‘Male’ and Pregnant=’Yes’;

Of course, inconsistent data does not arise only with impossible combinations. Some level of inconsistency may arise among variables that often co-occur. Very tall people may be unlikely to be in our sample even if their height is possible. If we see a body weight of more than 400 lbs, we wonder if the patient’s weight was taken while he/she was on a motorized chair. Even among probable events, inconsistent data should raise concerns about quality of the data. When all fields point to the same conclusion, then there is little concern with the quality of the data. When some fields suggest the absence of an event and other fields suggest that the field has occurred then a concern is raised, often requiring human chart reviews or a conversation with the patient. An example may demonstrate.

The Agency for Healthcare Research and Quality (AHRQ) has come up with several measures of quality of care using electronic health records. One such measure is the frequency with which medication errors occur. When a medication error occurs, clinicians are required to indicate it in the record. However, sometimes this is not done. Sometimes, an activity is done but not recorded in the right place. Thus, AHRQ’s patient safety indicator may rely on frequency with which heart failure patients are discharged with beta blocker prescriptions (an evidence-based treatment). The doctor may have put the prescription in the note but since the patient was going to their children’s home in a different State, they may have filled the prescription using a pharmacy that was not monitored in our medication reconciliation files. In these situations we would under count the number of beta blockers. The variation in reporting is one reason AHRQ recommends that expensive chart reviews be done to verify under or over reporting of patient safety issues. Some chart reviews, however, are not needed if other indicators are consistent with reported event.

To see if other variables in the electronic health record are consistent with the reported event, we predict the event from other variables. Then, comparison of predicted and observed values indicates the extent to which data are consistent. For example, one would expect that patients who are older, who had long hospital stays, who have multiple medications, and who have cognitive impairments are more likely to fall. Suppose we have developed an index, or plan to use an index found in the literature, that predicts patients who are likely to fall. Then, we can compare the predicted probability of fall to actual observed fall. If there is negligible

11

probability of fall and the patient has fallen, then something is not right. If there is high probability of fall and we see consequences of fall (prolonged hospitalization), then maybe the patient has fallen and the information is not correct inside the EHR. Table 5 shows hypothetical results. Chart reviews may need to be done in the cases of low probability events that were reported and high probability events that were not reported.

Table 5: Reports Inconsistent with Probability of the Event

Probability of Fall

Low Mediu

m HighFall Reported Not consistent ConsistentFall not reported Consistent Not consistent

A similar test of consistency can be applied to patient-reported outcomes such as pain levels. Obviously, pain is a subjective symptom. Some patients have more tolerance for pain than others. Other patients report the same level of pain but some decide to treat it with medications and others refuse medication. One can predict the expected pain level from the patient’s medical history. Then expected and reported pain levels can be contrasted. When patient-reported pain levels do not fit the patient’s medical history, then additional steps can be taken to understand why that is the case. For example, the patients’ medical history can be used to predict the potential for medication abuse. If the patient is at risk for medication abuse and is reporting inconsistent pain levels, then the clinician can be alerted to the problem so the clinician can explore the underlying reasons.

Table 6: Expected and Patient Reported Pain Levels

Expected Pain Level

Low Mediu

m High

Patient Reported levels

Low Consistent Not Consistent

Medium

HighNot Consistent Consistent

Data Errors: Inconsistent Data Format

In many situations, data are read into the database from an external source with wrong formatting. In a database, the type of data must be set a priori. There are many different types including integer, text, float, date, and real. CAST and CONVERT allow one to reassign the type of data. For example, if the field Age is read as text instead of a numerical value, then the following command will cast it as numerical float (a number with a decimal):

CAST (Age as Float) AS Age

12

The command says that the current field Age should be revised to be a numerical float. Thus age, 45.5 and 47.9 which were previously read as text, are now a number with decimal. In converting a field from text to number a problem arises with text entries such as the word “Null”. Note that this is not recognized as null value but as a text with four letters. A simple CAST will not know what to do in these situations. Instead one should use a code such as:

IIF(Age=’Null’, Null, CAST(Age as Float)) AS Age

The command says that if (shown as IIF) the field Age has a text field with the text ‘Null’, then the new field should treat this value as a true null value. Otherwise, it should case the Age field as a float.

Time Confusion: Landmark, Forward and Backward Looks

Data in electronic health records are time stamped. These dates and hours provide a chronology of events that have occurred in encounters with the healthcare system. Naturally, events for different patients occur at different chronological times. For the purpose of analysis, we need to start time at a landmark and count progression in time from this landmark. This is called landmark time to distinguish it from chronological time.

For example, suppose we want to evaluate the impact of a treatment on outcome. The treatment occurs at different chronological time for each patient. For some patients the event occurs in January and for others a month, a year later or a decade later. Once the treatment occurs, no matter when it is, the outcome must be evaluated in a certain follow up period after treatment. During this follow up period, the outcome of interest may or may not occur. To calculate the impact of a treatment on outcome, we would need to set the chronological time of treatment to zero (to start the time) at the date and time of patient’s treatment. The follow up period is then calculated from this date. If outcome occurs in the follow up period then it counts. If it does not occur, or if it occurs after the follow up period, then it does not count. The point is that, in all calculations, the analysis uses time from landmark event and not the chronological time.

In an electronic health record, where data are kept from birth to death, there are many different ways that one can design a study. One approach is a retrospective case/control design, where patients with and without an intervention, are contrasted. In this approach, one defines the cases and controls and looks backward to see if they were exposed to risk factors. Another approach is a cohort design, where a group of patients are followed over time to see what happens to them. In a cohort study, one selects a group of patients and looks forward in time to see if they have a specific outcome. Looking forward for data is typically done prospectively. In an EHR, where one can see the same patient over time, one can look forward using existing data. Figure 2 is a visual depiction of a cohort design, where the cohort is organized at a particular time in the past and followed still in the past, for a fixed time, to see if certain outcomes occur. In this situation, the outcomes are also observed in the past but after the organization of the cohort. Since both exposure and examination of outcomes occur in the past, it is often difficult to distinguish these study designs in EHR data. The easiest way to remember if we are faced with a case/control study design or a cohort design is to see if we are looking forward or backward.

13

Figure 2: Historical Cohort and Case Control Studies

Example of Forward Look for ReadmissionsAn example may demonstrate the issues that arise in defining either the cohort or the

cases study design. Suppose we want to study 30-day readmission of heart failure patients in the hospice program. To begin with, let us assume we want to do a cohort study, meaning that let us define a group of heart failure patients in hospice and see how many of them have readmission in 30 days. In this situation, the exposure is to “hospice;” and the outcome of interest is 30-day readmission. Hospice can affect readmissions in different ways. One scenario is that hospice use reduces active treatment and therefore patients are less likely to be readmitted and will pass away at home or in hospice and without hospitalization. Another scenario is that hospice use may increase treatment in part because hospice programs may fail to manage patients’ dyspnea (shortness of breath) and families, troubled by breathing problems of their love one, may choose to return to active treatment. What we know from the medical literature is that many heart failure patients die within a few days of hospice admission; and, if the patient dies in a hospital, then that would count as a hospital readmission within 30 days. To settle whether hospice use increases or decreases readmissions one needs to look at data. The data in the EHR can be organized to separate out these situations and count if hospice admission increases or decreases the probability of hospital readmission. If we want to look at data following the cohort study design, then the first step at looking at data is to define the exposed and unexposed cohorts.

One has to define two groups of patients, each constituting a different cohort (see Figure 3). The first group is the exposed group. These are heart failure patients who use hospice care; we refer to these as patients as discharged to hospice, even if the patient is admitted to hospice after a short stay at home. To sharpen the contrast between exposed and unexposed group, we would need to exclude all patients who do not have a primary diagnosis of heart failure hospitalization. We are only interested to see if hospice use affects readmission of heart failure patients. So, if the patient does not have heart failure, he/she should not be part of the study.

14

Figure 3: Cohort Study of Impact of Hospice on Re-admissions

The definition of the cohort requires the presence of both the index hospitalization for heart failure and hospice use; this is because calculation of 30day readmission requires an index hospitalization, and exposure requires use of hospice. Therefore, the most reasonable definition of the cohort is that both conditions must be present. The exposed cohort is compared to non-exposed cohort, which must have an index hospitalization but no hospice use. Note that patients who do not have heart failure are excluded from both the exposed and unexposed group. This is done to sharpen the contrast between the two groups.

Figure 4 provides some scenarios regarding the overlap between index heart failure hospitalization and hospice use. Depending on how we define our cohort, some readmissions will be ignored and therefore study findings will change. Patient A should not be part of the cohort because the index hospitalization occurs a long time before hospice use and therefore these two events are not co-occurring. In contrast, Patient B has an index hospitalization near start of hospice and therefore this patient is considered part of the cohort. Patient C and patient D are part of the cohort because index hospitalization and hospice use co-occur. Patient E is not part of the cohort because index hospitalization occurs after hospice use. Patients A, E and F are part of the un-exposed cohort. Note that patient A and Patient B are both part of the un-exposed group, even though they show that at some point they had hospice use. For these two types of patients, hospice use and index hospitalization does not co-occur and therefore the conditions for the landmark event that defines the exposed cohort are not met.

Figure 3: Different Ways an Index Hospitalization and Hospice May Co-Occur(I=Index Hospitalization, R=Readmission, Yellow bar=Hospice use

15

Example of Backward Look for HospiceIf we were to use a case/control study design, then the cases are defined as all patients

who have a 30-day readmission. The controls are all patients who do not have a 30-day readmission. The frequency of hospice use is counted among the cases and controls. We start with the outcome (readmission or no readmission) and look back for hospice use. In some sense, this is easier to define the case/control study as we do not need to define the co-occurrences of index hospitalization and hospice use.

Figure 4: Case/Control Study Design

Confusion in Unit of Analysis and Timing of CovariatesIn our discussion of the definition of the cohort or case/control studies, we have left two

important points unaddressed. The first issue regards the unit of the analysis. If the unit of analysis is admission, then we might have multiple admissions for the same person, removing the assumed independence among the observations. The advantage of relying on admission as unit of analysis has to do with the fact that clinicians will judge the need for hospice on each admission; we would be analyzing the data in the way that clinicians will be using the findings from the study. The other alternative is to have one observation per person. This assures that the assumption of independence of observations is met. Each patient has one value and the value is independent of other observations. For patients who have multiple hospitalization or hospice use we have a choice. We could choose one of the observations randomly or we could choose the first or last observations. These choices will also affect study findings. Consider for example the cohort design. If we choose the first time that index hospitalization and hospice use co-occur then we have more of a chance to observe another hospitalization. If we choose the last one, by definition we would not see the re-hospitalization. If we choose randomly, there is a chance we may choose the last one, which would again distort the findings. To maximize the possibility of finding another re-hospitalization, the right way to choose is to choose the first co-occurrence of index hospitalization and hospice use. A similar logic applies to case/control study design. The probability of finding hospice use increases if we choose the last re-admission. Therefore, in case/control studies it may be more important to look at the last 30-day readmission and not the

16

first. Again, choosing randomly will still distort the finding if we happened to randomly choose the first 30-day readmission.

The second issue is the time period in which we observe covariates. Cases and controls as well as exposed cohort and un-exposed cohort differ in more ways than hospice use. It is important to identify these covariates and statistically control for them. For exposed/unexposed cohort, the covariate must occur before the landmark time of the cohort. For case/controls, the covariates must also occur before the exposure to hospice (some exceptions apply, e.g. death during hospice). In other words, covariates should be measured at a timeframe when clinicians have access to these data. Only then can they choose a different course of action based on the medical history of the patient. In EHR, information about patients is available even after the point of the clinical decision. Using this information could help improve accuracy but it is unwise as it is not available at point of the decision making.

SummaryIn this chapter, we emphasize the importance of reviewing data before analysis. If

healthcare managers do analyses without proper preparation of the data, their conclusions may be erroneous. If they make erroneous conclusions, they could inadvertently strategically plan for use of resources and make costly mistakes. Careful attention to detail prior to data analysis is crucial in helping avoid such costly mistakes.

17

Chapter 2: Preparing Data Using ... - Open Online Coursesopenonlinecourses.com/spc/2 Preparing...

Documents

Transcript of Chapter 2: Preparing Data Using ... - Open Online Coursesopenonlinecourses.com/spc/2 Preparing...