=L...Chapter 2: Design and Validity Findley & Nguyen (2020), University of Illinois t...

7
Chapter 2: Design and Validity Findley & Nguyen (2020), University of Illinois ʹ Urbana-Champaign Page 21 Practice: Identify which design is being used in each of these studies Lunesta is a drug designed to treat insomnia. In a clinical trial of Lunesta, amounts of sleep each night are measured before participants begin taking Lunesta and again after participants have been treated with the drug. The researchers compare the average amount of sleep each participant was getting before and after starting Lunesta. A vaccine is being tested that would potentially prevent West Nile virus infection. A study is designed in which 4,800 volunteers in one particular city would agree to receive the vaccine. The incidence rate of West Nile Virus would be recorded for those 4,800 participants and compared to the incidence rate of the city at large. A small experiment is done to consider various treatments options for patients with glaucoma. 48 patients with varying levels of severity are split across three groups. The patients are divided in such a way that each group has approximately proportional numbers of severe, moderate, and mild cases. To assess the effectiveness of a new energy supplement, researchers pay 300 people to volunteer to be in the study. 150 are randomly assigned to take the supplement while the other 150 receive a placebo supplement. The researchers then check back in 3 weeks to have them rate their energy levels. x Now that we have discussed various experimental designs, let’s turn to thinking about how to rate the validity of different experiments. We start with defining and assessing Internal Validity Internal Validity Did the study function properly? x Internal validity is concerned with exactly how the study itself was conducted and how well it was designed to accurately identify a cause-and-effect relationship between variables. Can we confidently identify a causal relationship between two or more variables? x Key to understanding and assessing internal validity is understanding the concept of confounding. x Confounding variables are variables that have a causal effect on both our treatment variable AND our response variable. When discovering that two variables seem to correlate with each other (changes in one are associated with changes in the other), we should ask if it truly is a causal relationship, or if perhaps there is a confounding variable! o Confounding variables are especially prevalent in observational studies where we have done nothing to eliminate differences from confounders. But confounders can still pop up in experiments too when there are limitations to controlling the experimental setting. O O - - matched pairs design ( cross over or pro - post design ) o - - Et = aip non - randomized controlled experiment =L randomized block design - -0 - - - randomized control trial O O O

Transcript of =L...Chapter 2: Design and Validity Findley & Nguyen (2020), University of Illinois t...

  • Chapter 2: Design and Validity

    Findley & Nguyen (2020), University of Illinois Urbana-Champaign Page 21

    Practice: Identify which design is being used in each of these studies

    Lunesta is a drug designed to treat insomnia. In a clinical trial of Lunesta, amounts of sleep each night are measured before participants begin taking Lunesta and again after participants have been treated with the drug. The researchers compare the average amount of sleep each participant was getting before and after starting Lunesta.

    A vaccine is being tested that would potentially prevent West Nile virus infection. A study is designed in which 4,800 volunteers in one particular city would agree to receive the vaccine. The incidence rate of West Nile Virus would be recorded for those 4,800 participants and compared to the incidence rate of the city at large.

    A small experiment is done to consider various treatments options for patients with glaucoma. 48 patients with varying levels of severity are split across three groups. The patients are divided in such a way that each group has approximately proportional numbers of severe, moderate, and mild cases.

    To assess the effectiveness of a new energy supplement, researchers pay 300 people to volunteer to be in the study. 150 are randomly assigned to take the supplement while the other 150 receive a placebo supplement. The researchers then check back in 3 weeks to have them rate their energy levels.

    Now that we have discussed various experimental designs, let’s turn to thinking about how to rate the

    validity of different experiments. We start with defining and assessing Internal Validity

    Internal Validity Did the study function properly? Internal validity is concerned with exactly how the study itself was conducted and how well it was

    designed to accurately identify a cause-and-effect relationship between variables.

    Can we confidently identify a causal relationship between two or more variables?

    Key to understanding and assessing internal validity is understanding the concept of confounding. Confounding variables are variables that have a causal effect on both our treatment variable AND our

    response variable. When discovering that two variables seem to correlate with each other (changes in one are associated with changes in the other), we should ask if it truly is a causal relationship, or if perhaps there is a confounding variable!

    o Confounding variables are especially prevalent in observational studies where we have done nothing to eliminate differences from confounders. But confounders can still pop up in experiments too when there are limitations to controlling the experimental setting.

    O O-

    -

    matched pairs design ( cross -over or pro - post design)

    o

    -

    -

    Et=

    aipnon- randomized controlled experiment

    =L ↳randomized block design

    --0

    - -

    -

    randomized control trial

    O OO

  • Chapter 2: Design and Validity

    Findley & Nguyen (2020), University of Illinois Urbana-Champaign Page 22

    An Example of a Confounding Variable

    Practice: An observational study shows that people who drink more coffee live longer! What might be a confounder to explain this association?

    Practice: An observational study found that people who use sunscreen have a higher rate of skin cancer. Does that mean that people should stop using sunscreen to lower their skin cancer risk? What do you think?

    But what might confounding look like in an experimental setting?

    Practice: A randomized control trial was conducted to assess if students’ mathematical performance was enhanced by taking MathBar, a new protein bar. Half of the participants were given MathBar before the test and were told this would boost their focus and memory recall. They completed their exam in one classroom. The other half did not receive anything and served as a control group. They completed their exam in another room. The group that received MathBar had a statistically significant higher average score. The researchers claimed: MathBar improves students’ mathematical performance.

    a) Can you think of any confounding issues? What might be a threat to the researchers’ claims?

    b) What details might you change about this experiment to eliminate these confounding threats?

    Ice Cream Sales!

    Murder Rate Rises

    Heat Wave ?

    physically intellectual jobsdifferent ' → make enough money to afford

    health care

    people who spenda use sunscreen

    a lot of time under the sun € increase skin cancer risk>

    encouragement before examclassroom conditions

    7

    give control group a regular protein bart encouragement

    randomly assign participants to classroom

  • Chapter 2: Design and Validity

    Findley & Nguyen (2020), University of Illinois Urbana-Champaign Page 23

    Confounding can interfere with the causal link we are trying to establish in many different ways, even

    in an experiment! In addition to confounding, we should also consider other details of the study that create limitations.

    These include how we measured what we measured, whether the claim we make aligns with what we actually did, and other considerations. In this course, we address 6 broad families of Internal Validity Threats.

    Internal Validity Threats o 1) INSTRUMENTATION Is our means of measurement appropriate, reliable, and rigorous?

    Choosing an instrument/test for measuring our response variable is a careful and important choice. It requires us to be very thoughtful and specific about what we want to measure and who our participants are. There are a lot of technical details and ways to assess instrumentation, but we focus on a few simple considerations:

    Appropriate: Is the measure being used in the right category? Is it the right level of difficulty for your participants?

    At a basic level, this means ensuring that if your response variable is measuring strength, then your instrument shouldn’t be measuring pure endurance. At a more detailed level, it means choosing the right method of strength and for your participants and to align with your question.

    Reliable: It’s important to ensure that all data collected is done so consistently. This is especially a threat when readings are based on observations, or if measurements are prone to human error.

    Are measurements being taken with an equally consistent time gap between intervention and measurement? Are there any calibration errors? Are the testing conditions for all participants the same? Do observations rely on memory or imprecise recordings (difficult to observe or faulty self-reporting)?

    Rigorous: The instrumentation being used should be comprehensive and complex enough for the question being asked. Judging the rigor of our instrument often requires expert validation, especially when measuring more cognitive factors like mood or intelligence.

    If using a questionnaire, are we asking a comprehensive set of questions? Are the questions worded appropriately and assessing what we think they are assessing? What aren’t we measuring in our study?

    Example: Asking participants on a nicotine patch how many cigarettes they smoked in the last week

    Potential Biases: Appropriate: Is # of cigarettes the right measure? How does this compare to asking about smoking

    cravings? Reliable: Will participants correctly recall how many cigarettes they smoked? Rigorous: Does this comprehensively measure what we set out to measure?

  • Chapter 2: Design and Validity

    Findley & Nguyen (2020), University of Illinois Urbana-Champaign Page 24

    o 2) GROUP SELECTION Are the participants/units balanced for each set of measurements?

    For a multi-group design, we should always consider how the groups are comprised. This would be a validity threat whenever we didn’t randomly assign from a large group

    of participants. Blocking is a good strategy to build up validity, but may have its weaknesses if we can’t consider every blocking characteristic that might matter. Non-randomized designs would carry a large group selection threat.

    This could also be true for a Matched Pairs Design if we have not matched our pairs appropriately.

    Regression to the Mean is a specific type of group selection threat if we have chosen our groups or selected participants into a pre-post study based on low/troubled results. If we had a pre-test and sorted all of the lower-half scores to the treatment group and the upper-half scores to the control group, the treatment group has more grounds for improvement. We have the same issue in a pre-post study if we only chose participants with the most to gain.

    Example: Patients are asked if they want to try an experimental drug. Patients who say yes join the treatment group while those who say no are part of the control group.

    Potential Bias: Group Selection/Regression to the Mean: These are clearly non-balanced groups, and we might loosely

    think of the question posed to participants as a pre-test. Bottom line, people in the treatment group may be more driven to get better and may be characteristically and psychologically different.

    o 3) SETTING EFFECTS Is the setting and experience for all groups equivalent? Placebo Effect: Are participants improving just because they know

    they are receiving something? This is a concern when we don t have an appropriate placebo for the control group or have a pre-post study with no comparison group. The characteristics surrounding the treatment (a caring nurse, optimism) rather than the treatment itself may be confounders that contribute to the favorable results. Psychological effects may be contributing to their improvement!

    Environment Condition Differences: Are the environmental conditions different between the treatment and control conditions beyond the treatment factor you wish to study?

    Reactance/Researcher Effects: If the participants know whether they are in the treatment group or control group, they react differently, much like a placebo effectwhen possible, Blinding is advisable to minimize such effects. Additionally, researchers interacting with the participants know who is in which group, they may act differently around each group (more encouraging or more engaged in the interaction, etc.). A Double-Blinding would mean those administering the treatment/control conditions also don’t know who is in which group.

  • Chapter 2: Design and Validity

    Findley & Nguyen (2020), University of Illinois Urbana-Champaign Page 25

    Example: Consider the MathBar example from earlier. What setting biases were present?

    Potential Biases: Placebo Effect: Participants receiving something, regardless of what it is, might be what is driving

    their performance. Environment: Students did not complete the exam in the same setting/environment. Reactance/Researcher Effects: Participants being told they are receiving something may

    psychologically affect their results

    o 4) TIME EFFECTS Could time-related factors create possible confounding factors? This factor can apply to any experiment in which each group sets’ measurements don’t

    get measured at the same time. Effects due to History would be when our study is effected

    by different times of day, different weather, different seasons, different sets of current events, and perhaps even different cultural norms if several years removed!

    If the control group has their measurements done in the morning and the treatment group has their measurements done in the afternoon, that could effect the results depends on what I’m doing ! Same thing if it’s sunny for the control group and rainy for the treatment group. Same thing if one group has their intervention during fall and the other during winter, etc.

    Some studies use historical control groups for comparison to a currently-run treatment group these have a large history threat.

    Effects due to Maturation is also time effects threat, and would be specific to pre-post designs. In addition to effects from history, pre-post studies also run the threat of participants simply maturing/changing/growing before the final measurement.

    Key to recognizing time effects is a systematic bias that affects the control and treatment measures differently. If I had two cohorts at two different times, but equally comprised of treatment and control group participants, that won’t necessarily create a time-related threat on my study!

    Example: Perhaps we want to consider the benefits of a yoga class is for a group of people struggling with high blood pressure. We measure their blood pressure before the first class and measure their blood pressure again three months later after 12 classes.

    Potential Biases: History: The participants’ blood pressure may be affected by seasons, weather, or current events. Maturation: The participants may have also made changes to their diets/lifestyles.

  • Chapter 2: Design and Validity

    Findley & Nguyen (2020), University of Illinois Urbana-Champaign Page 26

    o 5) MORTALITY Did participants/units drop out of the study and create imbalance? Attrition is a threat when participants drop out in different rates between our groups,

    or if in a Pre Post Design, we believe participants may be dropping out because of the treatment conditions.

    Participants find the treatment conditions uncomfortable Participants simply stop adhering to the treatment.

    When our attrition rates are different between groups, it’s difficult to attribute differences in measurements to the treatment factor rather than it just being some type of filter that weeds out certain participants and produces unbalanced ending groups.

    Death would be another factor especially prevalent in more life-critical clinical trials. Perhaps we start with 100 people in each group, and 32 die in the treatment group and 15 die in the control group. But the people who survived in the treatment group had better test results than the survivors in the control group. Is that because the medication is better, or because our ending groups are no longer balanced?

    Example: Consider the arthritis patients from earlier maybe there are people in the treatment group dropping out because (whether they admit it or not) they find the physical therapy painful.

    Potential Bias: Attrition: Effects of the treatment resulting in unbalanced ending groups. Confounding our findings.

    o 6) TEST FAMILIARITY Are participants simply getting better at completing the measure? This is a threat any time participants are taking the

    same test/measure more than once. By design, a PrePost Design is frequently going to have this threat to some degree.

    This threat would be most pertinent when the instrumentation is a mental or physical testparticipants may just do better by being more used to it!

    It could mildly interfere in other measures if the participant is just more comfortable with the measure perhaps a blood pressure measurement or blood test.

    This is one of the biggest considerations to make when choosing a design. If the measure carries a high test familiarity threat, the researchers should consider avoiding a pre-post design!

    Example: Patients after surgery are being given treatment for mobility. They are asked to perform a set of stretches each day and their progress is monitored.

    Potential Bias: Participants may be growing more flexible from practice rather than treatment.

  • Chapter 2: Design and Validity

    Findley & Nguyen (2020), University of Illinois Urbana-Champaign Page 27

    Practice: An educational researcher is wondering whether her students learn better in a traditional classroom setup or a flipped classroom setup (Flipped classroom is when students watch videos as homework and come to class to work on problems). She does the traditional class in the fall semester and the flipped class in spring semester with the same course, but a new set of students. She wants to know if both sets of students have approximately equal academic success, so she gives both classes the same exam at the end of each semester.

    What are some internal validity threats we should consider with this study?

    Additional thoughts on Internal Validity in Experiments o Do I need Equal Group Sizes?

    In later chapters, we’ll discuss inferential claims and the statistical power of our conclusions. Basically, your confidence in identifying an effect is going to be limited to the size of your smallest group!

    But in terms of the validity of the study itself, equal group sizes doesn t actually matter. What matters is that your groups have equivalent people/units, and that any difference

    you find between groups is due to the treatment factor alone. o Do validity threats signal mistakes or biases from the reserachers?

    It might be tempting to think that all experiments with validity threats are simply due to poor planning, researcher bias, or just flaws in general. But that’s not always the reason.

    Natural Limitations and Ethical Considerations are often at play

    In many experimental settings, researchers have unavoidable validity threats. It may not always be possible to enact blinding based on the nature of the study Instrumentation is often a judgment call for the researchers to make and is almost

    always going to require scrutiny and careful assessment by the research community. Timing and setting effects may simply be unavoidable based on resource limitations. Random assignment, or an experimental design in general, may just not be ethical

    con ide d ing he po en ial effec of p egnan omen alcohol con mp ion on fetal development).

    o Good science isn’t about designing and reporting findings from perfect studies it’s about designing good studies and being careful to recognize potential holes and valuing replications.

    ① Instn : Is exam a good measurement of how well studentslearn ?

    ② G-raipseh-e.com. Is there a fundamental differencebetween Fall and Spring

    students ?

    ③ Settingfeats : different semesters / different season, weather

    ④ Timeeffecti exams taken at differenttime

    ⑤ Mory: 8 of students dropthe class

    ⑥ Testfami-an.ly : Springstudents know the exam

    ( potentially)