Trait Means and Desirabilities as Artifactual and Real Sources of Differential Stability of...

37
Trait Means and Desirabilities as Artifactual and Real Sources of Differential Stability of Personality Traits Dustin Wood 1 and Jessica Wortman 2 1 Wake Forest University 2 Michigan State University ABSTRACT Using data from 3 personality trait inventories and 7 samples, we show that trait items that have means near the scale mid- point and that vary more in their perceived desirability (e.g., items related to dominance, creativity, traditionalism, and organization) tend to be more stable over time, whereas items with means near the scale maximum or minimum and that vary less in their perceived desira- bility (e.g., items related to agreeableness, intellect, and reliability) tend to be less stable. Our findings indicate that items with means near the scale maximum or minimum have lower stabilities primarily due to having lower measurement dependability (i.e., short-term stabilities unlikely to reflect true change). However, items varying more in their desirability are more stable even after accounting for measurement dependability, consistent with the view that trait stability is facilitated in part by individuals actively working to develop in the direction they find desirable. We would like to thank Grant Edmonds and Joshua Jackson for their helpful comments on an earlier draft of this article. We are particularly grateful to Michael Chmielewski and David Watson for providing the two-week dependability estimates of the Big Five Inventory (BFI) items, which afforded a much expanded consideration of the role of transient measurement error toward the research questions in this inves- tigation. Note that the BFI dependability estimates used here are reported at the BFI scale level in Chmielewski and Watson (2009), and many of the Inventory of Individual Differences in the Lexicon dependability estimates are reported in Wood, Nye, and Saucier (2010). Correspondence concerning this article should be addressed to Dustin Wood, Department of Psychology, Wake Forest University, 526 Greene Hall, Winston-Salem, NC 27109. Email: [email protected]. Journal of Personality 80:3, June 2012 © 2011 The Authors Journal of Personality © 2011, Wiley Periodicals, Inc. DOI: 10.1111/j.1467-6494.2011.00740.x

Transcript of Trait Means and Desirabilities as Artifactual and Real Sources of Differential Stability of...

Trait Means and Desirabilities as Artifactual andReal Sources of Differential Stability of

Personality Traits

Dustin Wood1 and Jessica Wortman2

1Wake Forest University2Michigan State University

ABSTRACT Using data from 3 personality trait inventories and 7samples, we show that trait items that have means near the scale mid-point and that vary more in their perceived desirability (e.g., itemsrelated to dominance, creativity, traditionalism, and organization) tendto be more stable over time, whereas items with means near the scalemaximum or minimum and that vary less in their perceived desira-bility (e.g., items related to agreeableness, intellect, and reliability) tendto be less stable. Our findings indicate that items with means near thescale maximum or minimum have lower stabilities primarily due tohaving lower measurement dependability (i.e., short-term stabilitiesunlikely to reflect true change). However, items varying more in theirdesirability are more stable even after accounting for measurementdependability, consistent with the view that trait stability is facilitated inpart by individuals actively working to develop in the direction they finddesirable.

We would like to thank Grant Edmonds and Joshua Jackson for their helpfulcomments on an earlier draft of this article. We are particularly grateful to MichaelChmielewski and David Watson for providing the two-week dependability estimatesof the Big Five Inventory (BFI) items, which afforded a much expanded considerationof the role of transient measurement error toward the research questions in this inves-tigation. Note that the BFI dependability estimates used here are reported at the BFIscale level in Chmielewski and Watson (2009), and many of the Inventory of IndividualDifferences in the Lexicon dependability estimates are reported in Wood, Nye, andSaucier (2010).

Correspondence concerning this article should be addressed to Dustin Wood,Department of Psychology, Wake Forest University, 526 Greene Hall, Winston-Salem,NC 27109. Email: [email protected].

Journal of Personality 80:3, June 2012© 2011 The AuthorsJournal of Personality © 2011, Wiley Periodicals, Inc.DOI: 10.1111/j.1467-6494.2011.00740.x

Despite the recognition that some individual difference dimensionsare more stable than others, this understanding has largely beenlimited to the recognition that the rank ordering of individual differ-ences along dimensions such as weight, height, and intelligence aremore stable than the rank ordering of individual differences alongdimensions such as personality traits or mood (e.g., Conley, 1984;Fujita & Diener, 2005). There is little understanding as to why somepersonality trait dimensions are more stable than others. Indeed,some have downplayed the importance of addressing this questionby suggesting that Big Five trait dimensions are about equally stable,and that consequently there may be no important differences instability of personality traits to explain (Caspi, Roberts, & Shiner,2005; Roberts & DelVecchio, 2000). Despite this suggestion, therehave been recent calls for more research on the topic, as evidence isconverging that the stabilities of different personality traits do in factdiffer across domains (Watson, 2004). For instance, one of the moreregular recent findings has been that individual differences in extra-version might be more stable than individual differences in neuroti-cism (Hampson & Goldberg, 2006; Vaidya, Gray, Haig, Mroczek, &Watson, 2008; Vaidya, Gray, Haig, & Watson, 2002).

In the current study, we attempt to identify trait properties thatcan predict why certain dimensions are more stable over time thanothers, similar to studies that have identified trait properties such asobservability and low evaluativeness as leading to increased agree-ment in person perception studies (e.g., Funder, 1995; Funder &Dobroth, 1987; John & Robins, 1993). Research on differential sta-bility across traits has typically focused more on differences thatmight arise from how participants are asked to rate the same orsimilar content (e.g., by asking people, “how much does this itemdescribe how you see yourself?” vs. “how much does this itemdescribe how you feel?”; Watson, 2004). The few investigations thathave explored how trait content may be associated with trait stabilityhave indicated that measures of trait dimensions associated withmood- or affect-related content (e.g., happiness, anxiety) tend to besomewhat less stable than measures of more behavioral dimensions(e.g., sociability, organization; Vaidya et al., 2002; Watson, 2004).

In the current investigation, we hope to contribute to the under-standing of differential stability by demonstrating more compellinglythat trait differences in stability exist, and that these differences canbe understood as arising both from differences in transient error and

Wood & Wortman2

from differences in how much traits vary in their desirability.Concerning the latter, we are particularly interested in the broaderidea that individuals have a role in directing their own trait levels(Roberts, Wood, & Caspi, 2008), and that therefore the differentialstability of personality traits can be understood as arising from theefforts of individuals to develop their traits in the direction theypersonally find desirable. We explore these ideas by showing howtrait stability coefficients may be associated with three very basicproperties of traits and the way they are measured. In particular, wefocus on how a trait item’s measured stability is associated with (1)the item’s variability of endorsement (SDX), (2) the item’s mean levelof endorsement (MX), and (3) the item’s variability in rated desir-ability (SDD). A diagram summarizing our expected associations isprovided in Figure 1.

Property 1: A Trait Item’s Standard Deviation (SDX)

The first property of a trait item that we expect to be related to itsobserved stability is how much endorsement of the item varies across

Endorsement Variability

(SDX)

Desirability Variability

(SDD)

Endorsement Mean (MX)

Desirability Mean (MD)

H2

H3

H1 Rank-order stability

Figure 1Expected associations between means and standard deviations oftrait endorsement, and means and standard deviations of a trait’s

rated desirability. H = hypothesis (described in text).

Trait Desirability and Stability 3

people in the population, as indexed by the standard deviation.Researchers have given little attention to how a trait item’s standarddeviation is linked to its level of stability. However, it would be verysurprising to find that the measured variability of different traitsis unassociated with stability, for the simple reason that measuresthat have limited variability or restricted range are not expected tocorrelate highly with any other variables (Cohen, Cohen, West, &Aiken, 2003). This leads to our first hypothesis:

Hypothesis 1: Trait items with greater variability (SDX) have higherlevels of stability (s).

As detailed in Figure 1, we expect there are at least two majorreasons that traits show differences in their variability. First, differ-ences in the estimated variability of traits are likely producedthrough a potentially artifactual path having to do with restrictedrange. Second, differences in the estimated variability of traits arealso likely produced through variation in the trait’s desirability. Weelaborate on both routes below.

Property 2: A Trait Item’s Mean (MX)

A trait item’s mean level of endorsement may be a source ofartifactual differences in stability estimates across traits. Given theway in which personality traits are generally measured (e.g., with5-point scales ranging from strongly disagree to strongly agree), anumber of items are endorsed at the highest or lowest level possibleby a large percentage of participants, leading to strong ceiling orfloor effects. As a result of those effects, traits with means near theend points of the scale should have less variability, as shown bysmaller standard deviations, which in turn should reduce the extentto which the trait as measured can correlate with other variables, assuggested earlier.

Hypothesis 2: Trait items with means (MX) that deviate highly fromthe scale midpoint will have lower levels of variability (SDX) andstability (s).

Although cross-trait analyses linking a trait’s mean level of endorse-ment to its variability of endorsement have not been conducted toour awareness, substantial relationships between item means anditem variability estimates have been reported in other investigations

Wood & Wortman4

(Baird, Le, & Lucas, 2006; Eid & Diener, 1999; Paunonen & Jackson,1985), in particular showing that items with very high or low meanshave lower variability. This relationship indicates the existenceof ceiling and floor effects, where meaningful variability is notseparated at extreme trait levels. For instance, we could potentiallyseparate participants who see themselves as “highly honest” fromparticipants who see themselves as “exceptionally honest” if we wereto provide participants with a scale with more extreme responseoptions. Therefore, we consider that item means may primarily linkto the item’s stability through a restriction of range artifact associ-ated with ceiling and floor effects.

Property 3: A Trait’s Variability in Desirability (SDD)

Our third explanation of why some traits differ from one another inmeasured stability concerns the magnitude of variation across indi-viduals in how desirable the trait is perceived to be. It is well knownthat desirable traits are much more likely to be endorsed in self-ratings than undesirable traits (e.g., Edwards, 1953). However, it hasnot been recognized to our awareness that the standard deviations oftrait self-ratings and of trait self-ratings are also likely to be highlyassociated. This leads to this prediction:

Hypothesis 3: Trait items with greater between-person variabilityin desirability (SDD) should have both greater levels of variability(SDX) and stability (s).

As we described earlier, we expect trait means to relate to traitstabilities mainly through the operation of a measurement artifact.In contrast, finding that traits varying more in their perceived desir-ability also show higher variability and stability would support theoperation of a more theoretically interesting source of differences intrait stability estimates. In particular, we expect that individualspossess traits at the levels they do in part because they have takenefforts to possess characteristics they find desirable or associatedwith desirable outcomes (Baltes, 1997; Roberts et al., 2008; Wood,Gosling, & Potter, 2007). We expect that the well-documentedfinding that mean trait endorsements are highly associated withmean trait desirabilities is largely driven by the fact that when traitsare seen as widely desirable, most people are motivated to possesshigher levels of the trait, resulting in both higher trait means and a

Trait Desirability and Stability 5

smaller range of variation in trait levels. For instance, the desirabilityof agreeableness and conscientiousness seems to increase somewhatwith age, and this may be one of the reasons these traits increase withage (Roberts, Wood, & Smith, 2005; Wood et al., 2007). However,when individuals vary considerably in the extent to which they per-ceive the trait to be desirable, this should push for greater variationin endorsement of the trait, as different individuals are motivatedtoward different poles of the trait. As a result, traits for whichthere is greater variability in desirability (high SDD; e.g., traditional)should have greater stability because people are motivated toprogress toward different poles of the trait. In contrast, for traits thatare consensually desired or undesired (low SDD; e.g., nice, anxious),stability should be decreased due to the fact that nearly all individu-als are attempting to develop toward the same pole on the traitdimension. This may result in both lower variability and loweredstability, as differential success in development of the trait shouldcause the rank ordering of people highest and lowest in the trait tochange more frequently.

Sources of Trait Differences in Measured DependabilityVersus Systematic Stability

We have thus far discussed potential sources of trait differences inthe standard stability coefficients explored by psychologists, whichare formed simply by correlating participant scores on a measurecollected at two time periods—what we will refer to more preciselyas simple stability estimates. However, as noted by numerous inves-tigators, our conclusions about the nature of trait differences instability could differ substantially as a function of whether we haveaccounted for measurement error (e.g., Chmielewski & Watson,2009; Costa, McCrae, & Arenberg, 1980; Ferguson, 2010). We havespeculated that trait items with very high or low means will beestimated as less stable through a restriction of range artifact,whereas traits varying more in their perceived desirability will trulybe more stable due to individuals working to develop their traits inthe direction they see as desirable. We continue by elaborating onhow the validity of these hypotheses can be clarified by consideringa measure’s dependability and systematic stability coefficients,in addition to the standard simple stability estimates usually con-sidered in stability studies.

Wood & Wortman6

The dependability of a measure (also frequently referredto as test-retest reliability; e.g., McCrae, Kurtz, Yamagata, &Terracciano, 2011) is computed in the same way as a simple stabilityestimate—by correlating participant scores on a measure collectedsome time period apart—but it differs in that the two scores arecollected over a test-retest interval that is so short that true changesin the rank ordering of placement in the dimension being mea-sured should be negligible (e.g., a 2-week interval; Cattell, Eber, &Tatsuoka, 1970; Watson, 2004). There is likely substantial variationin levels of dependability across personality trait items. To illustrate,Wood, Nye, and Saucier (2010) reported that self-descriptions ofhow much one is generally outgoing/sociable, disorganized/messy,and creative/imaginative showed considerably higher levels ofdependability over 3 days (rs ª .77) than self-descriptions of howmuch one is generally pleasant/agreeable, courteous/polite, andtruthful/honest (rs � .54). Similarly, Chmielewski and Watson(2009) reported that the Big Five Inventory (BFI) extraversion scaletended to show higher retest consistency over a 2-week interval thanthe BFI agreeableness scale. This is despite the fact that the rankordering of how individuals order themselves along the dimensionsof “general outgoingness” and “general pleasantness” should beexpected to change negligibly over such a short period.

As noted by Chmielewski and Watson (2009), the dependability ofa measure will decrease as scores on the measure at a given time pointincreasingly reflect inconsistent influences on a person’s reportedscore even over the span of a couple of days—what they refer to astransient error—and consequently increase the rate of inconsistentresponding of the measure by the same participant over negligibletime differences. Such inconsistencies should arise in large part fromvariations in the particular trait-relevant information brought tomind when making trait inferences, which may be impacted by recentexperiences or current moods (Robinson & Clore, 2002; Schwarz &Clore, 1983).

In contrast, systematic stability (referred to as “true stability”and “corrected stability” in Chmielewski & Watson, 2009) alsorefers to the stability in the rank ordering of individuals’ scores ona measure that would be observed over two time points, but itdiffers from simple similarity coefficients in that the participantscores being correlated reflect no transient error. Systematic stabil-ity can be thought of as referring to a test-retest correlation relating

Trait Desirability and Stability 7

how participants systematically respond to an item at one timepoint (i.e., how they respond on average) with how they systemati-cally respond to the same item at a later time point. A purelyempirical estimate of a measure’s systematic stability can be formedby correlating participant average scores after each participant hascompleted a large number of ratings at one time point with theaverage of a large number of different ratings completed at somelater time period, as has been done in some experience samplinginvestigations (e.g., Baird et al., 2006; Fleeson, 2007; Noftle &Fleeson, 2010). However, given that it is often impractical or juststrange to collect such a large number of ratings on some scaleswithin the time period, especially with general self-ratings (e.g.,asking people, “how talkative do you see yourself in general?” 30times in a week), it is also possible to estimate a measure’s system-atic stability by dividing the measure’s simple stability estimate(e.g., a test-retest correlation over a year) with the measure’sdependability estimate (e.g., the test-retest correlation over a week;Chmielewski & Watson, 2009). This is shown through the pathdiagram depicted in Figure 2: the simple stability of a scale (s) is

X1 X1’

Time 1 Trait Level

X2 X2’

Time 2 Trait Level

d1 = r

11'd

2 = r

22'

s = r12

21

* =dd

sS

1d1d 2d2d

Figure 2Modeling the systematic stability (S*) of individual differencesin trait levels from observed dependability estimates (d) and

observed stability estimates (s).

Wood & Wortman8

mathematically expected to reflect the systematic stability of a scale(S*) multiplied by the scale’s dependability (d) at Time 1 and atTime 2:

s S d d= * 1 2 (Eq. 1)

If the dependabilities at Time 1 and 2 are allowed or assumed tobe equal (d = d1 = d2), then solving for the unobserved systematicstability in Equation 1 can be done using the simple stabilities anddependabilities observed in the data:

S s d* = / (Eq. 2)

Understanding simple stability estimates to be a function of boththe level of systematic stability of the ordering of individuals andthe level of dependability of the measure allows us to better illumi-nate our earlier hypotheses. As suggested earlier, trait means areexpected to relate to stability estimates by indicating the operationof a restriction of range artifact that increases the impact of incon-sistent responding, rather than indicating which traits are trulymore stable than others. This is an interpretation that leads to clearpredictions for how trait means should relate to different types ofstability estimates:

Hypothesis 4a: Trait item means (MX) will be highly associated withitem dependabilities (d) but should be negligibly associated withitem systematic stabilities (S*).

In contrast, traits varying more in their perceived desirability areexpected to be more stable due to individuals actively working todevelop their trait levels in the direction they perceive as desirable.Since trait desirabilities should tend to be correlated with initialtrait levels, this should serve to preserve the rank ordering of howindividuals systematically see themselves over time due to activemaintenance efforts. The relationship between trait SDD’s and simplestability estimates is expected to reflect real differences in whichtraits are more stable, rather than which traits are impacted more bytransient measurement error, and this interpretation leads to differ-ent predictions in how they should relate to the different types ofstability estimates:

Trait Desirability and Stability 9

Hypothesis 4b: Trait items with higher variability in desirability(SDD) should be negligibly associated with item dependabilities (d)but should show greater systematic stability (S*).

As shown in Figure 1, we expect trait item variabilities (SDX) to beindependently influenced by both item means (MX) and item vari-abilities in desirability (SDD), in addition to certain other trait pro-perties. Consequently, we expect that trait item variabilities will alsobe significantly associated with the item’s systematic stability esti-mates, although we venture no hypotheses about whether these asso-ciations will be equal in magnitude to the associations found withitem variabilities in desirability.

The Current Study

We test the hypotheses outlined above in analyses using three differ-ent item pools: the Big Five Inventory (BFI; John & Srivastava,1999), a set of Goldberg adjective markers of the Big Five that havebeen used in previous research (GBFM; Walton & Roberts, 2004;Wood, Harms, & Vazire, 2010), and an early version of the Inven-tory of Individual Differences in the Lexicon (IIDL; Wood, Nye,et al., 2010).

Studies of differences across traits tend to be done at one of twolevels: at the level of broad scales/latent factors (e.g., the Big Five;Roberts & DelVecchio, 2000; Vaidya et al., 2008), or at the level ofsingle items considered without aggregation (Funder & Dobroth 1987;John & Robins, 1993). We took the latter approach in the presentinvestigation due to considerable evidence that there is substantial vari-ability in the properties and correlates of traits that are considered tobe contained within a single Big Five domain (e.g., Jackson et al., 2009;Wood, Nye, et al., 2010). For instance, sociability, positivity affectivity,assertiveness, and sensation seeking clearly are expected to havediffering properties, despite being considered “facets” of extraversion.Consequently, aggregating these separate traits into an overall extraver-sion measure will tend to gloss over reliable differences acrosstraits (Wood & Hensler, 2011; Wood, Nye, et al., 2010). We will ulti-mately demonstrate through these single-item analyses that there areclear indications of differences in rank-order stabilities of traits, evenamong traits considered to be within the same Big Five domain, andthese differences are consistent with the hypothesis that traits varyingmore in desirability are more stable.

Wood & Wortman10

METHOD

Data were collected from a variety of samples to obtain estimates of eachinventory’s item properties. As the analyses proceeded at the item level ofanalysis, we organize this section by the three inventories used and thendescribe the samples used to estimate different trait properties.

Inventory 1: Goldberg Big Five Markers

University of Illinois Greek sample (UIUC-G). A total of 366 partici-pants were recruited from seven different fraternities and sororities at theUniversity of Illinois. Of these participants, 203 (55%) were womenrecruited from three sororities, and the remaining 163 men were recruitedfrom four fraternities. Participants had a mean age of 19.5 at the firstassessment. The number of participants per organization ranged from 34members in one fraternity to 75 in two sororities. Participants completedthe survey at the organization’s house, usually in the span of about 2hours. Participants were given $10 for completing the 2-hour survey, andthe organizations were also given money for their assistance with thestudy.

Participants rated their level of endorsement of 53 items of a larger setof 100 Big Five markers (BFM) originally reported by Goldberg (1992)and used in previous research (Walton & Roberts, 2004); these itemsconsisted of 10 markers for all Big Five trait dimensions, except forneuroticism, which had 13 markers. Items were rated using a 5-pointresponse scale ranging from 1 (Strongly disagree) to 5 (Strongly agree),with items presented in alphabetical order. Note that in this and in theremaining samples, trait means and standard deviations were subse-quently rescaled from the original metric (e.g., 1–5 scales) to a percentageof maximum possible (POMP, range 0–100; Cohen, Cohen, Aiken, &West, 1999) in order to place all the inventories on the same metric. Toobtain simple stability estimates, a total of 180 of the 366 participants whocompleted the original survey completed the same self-report survey ayear later.

Due to the fact that previous analyses with this inventory found wordswith an “un-” prefix (unenvious, unintelligent, unreliable) to show an acqui-escence rating bias (Wood, Harms, et al., 2010), these adjectives wereremoved from the analysis. Additionally, the item “imperturbable” wasremoved because participants frequently reported not knowing what theword meant, and consistent with this, the item showed an extremely highnumber of neutral responses, resulting in an artificially low estimate ofthis item’s variability. These omissions resulted in the analyses using theGBFM being based on a total of 40 items.

Trait Desirability and Stability 11

GBFM means and standard deviations were available from the twotime points. These estimates were averaged together, resulting in highlyreliable estimates of the ordering of item means (a > .99) and standarddeviations (a = .97).

WFU Subject Pool-1 (WFUSP-1) sample. A total of 249 students atWake Forest University completed a mass-testing survey in ordertoreceive credit for their introductory psychology course. Participants com-pleted the GBFM items in a modified form in which they were asked torate how desirable each item was, using the question “How desirable is itto be someone who generally . . . ,” with response options ranging from 1(Very undesirable) to 5 (Very desirable).

Inventory 2: Big Five Inventory)

UIUC dorm (UIUC-D) sample. All incoming freshmen at the Universityof Illinois at Urbana-Champaign were invited to complete the survey nearthe beginning of the school year (participants were emailed the survey aweek before arriving on campus). Participants were asked to rate them-selves on the 44-item BFI (John & Srivastava, 1999) using the standard1 (Strongly disagree) to 5 (Strongly agree) response scale. A total of 1,850people completed this initial survey. A total of 983 individuals whoseroommates had also completed the survey were recontacted to completefollow-up surveys; these participants had a mean age of 18.1 at the firstassessment, and 59% were female. The first follow-up survey was 3 monthsafter the first survey; 550 participants completed this follow-up survey,resulting in 3-month item stability estimates based on between 540 and550 observations. The second follow-up survey was approximately eightmonths after completing the first survey (near the end of the secondsemester); 464 participants completed the survey, resulting in 8-monthitem stability estimates based on between 457 and 464 observations.Across the 44 BFI items, the 3-month and 8-month stabilities correlated.91, and consequently these two values were averaged together to obtainsimple stability estimates of the BFI items.

UI student (UI-S) sample. Dependability estimates of BFI items wereobtained from data collected by Chmielewski and Watson (2009). A totalof 556 students in psychology courses at the University of Iowa completedthe BFI twice 2 weeks apart, for course credit or monetary compensa-tion. Additional information about the data collection is reported inChmielewski and Watson (2009, p. 189).

WFU Subject Pool-2 (WFUSP-2) sample. A total of 324 students atWake Forest University completed a mass-testing survey in order to

Wood & Wortman12

receive credit for their introductory psychology course. Participantsrated how desirable they thought each item was, using the question “Howdesirable is it to be someone who generally . . . ,” with response optionsranging from 1 (Very undesirable) to 5 (Very desirable).

Inventory 3: Inventory of Individual Differences in the Lexicon, EarlyVersion

The development of the IIDL instrument is reported by Wood, Nye,et al. (2010). The IIDL was developed to be a “comprehensive” instru-ment measuring any content found to be represented by several terms inthe lexicon rather than focusing on measuring the core vectors of the BigFive dimensions, as with the GBFM and BFI instruments, and conse-quently it surveys a broader range of content than that found in thoseinventories. An early 58-item version of the IIDL and 13 additional itemsfrom the final version of the IIDL were used here, resulting in a total of 71items. The earlier version resulted from a preliminary analysis using aslightly different item pool than the one used in the final analysis. Someitems assessing physical dimensions (e.g., thin/slender, short/little) werenot assessed in the samples used here.

WFU dorm (WFU-D) sample. Participants were 381 individuals fromthe freshman dormitories at Wake Forest University; these participantshad a mean age of 18.5 at the first assessment, and 60% were female.Individuals were offered $15 for completing the survey. At the first wave,participants completed the early 58-item version of the IIDL. The longi-tudinal assessments consisted of the same instrument plus 13 additionalitems from the final version of the IIDL covering content not reflected bythe initial 58 items. A total of 189 individuals were invited to completesubsequent waves of the study. A total of 140 of these 189 individualscompleted a follow-up assessment 4 months after the first assessment, anda total of 147 of these 189 individuals completed a follow-up assessmenta year after the first assessment. Finally, in all three waves, participantscompleted the IIDL twice a few days apart in order to estimate the retestdependability (average retest interval was 5.2 days at Wave 1, 4.4 days atWave 2, and 6.1 days at Wave 3; see Wood, Nye, et al., 2010, for moreinformation). The item dependability estimates were highly correlatedacross the 58 items administered in all three waves (avg. r = .65), andconsequently the three estimates for each item were averaged together. Asthe IIDL was completed twice at each assessment, there were eight esti-mates available of the test-retest stability over about half a year (all fourcombinations of the lagged IIDL measurements at the first and secondwaves, and all four between the second and third waves); these were

Trait Desirability and Stability 13

averaged together to estimate the half-year stability of the IIDL items.Additionally, there were four estimates of the stability over a year (all fourcombinations of the lagged IIDL measurements at the first and thirdwaves); these were averaged together to estimate the one-year stability ofthe IIDL items. Finally, the half-year and one-year stability estimateswere very highly intercorrelated across the 58 IIDL items with bothestimates (r = .87) and were thus averaged together to form the simplestability estimates.1

IIDL means and standard deviations were available from six differenttime points (two assessments at each of the three waves). These estimateswere averaged together, resulting in highly reliable estimates of the order-ing of self-rating means (a > .99) and standard deviations (a > .98).

WFU Subject Pool-1 (WFUSP-1) sample. Participants from the sameWFUSP-1 sample described earlier indicated the extent to which theyperceived each item of the early 58-item version of the IIDL to be desir-able on the same 1 to 5 scale used to rate the GBFM items. The IIDL itemswere presented in alphabetical order.

Online Survey (OS-1) sample. As some IIDL items examined here werenot rated for desirability by the WFUSP-1 sample, visitors to the “Celeb-rity Similarity Test” Web site rated the extent to which they thought eachitem was desirable using a 1 (Extremely Undesirable) to 7 (ExtremelyDesirable) response scale, and ratings were included if raters rated at least49 of the 51 items assessed; a total of 291 individuals were included bythese criteria. The means and standard deviations of the desirabilityratings were estimated for each IIDL item. To place these in the same scaleas the desirability ratings obtained from the WFUSP-1 sample, we pre-dicted the WFUSP-1 desirability means and standard deviations from thesame coefficients obtained from the OS-1 sample (r = .99 for desirabilitymeans across samples, and .70 for matching desirability standard devia-tions, N = 38) and saved the predicted desirability means and standarddeviations for the additional items not assessed by the WFUSP-1 sample(for the remaining 58 items, we simply used the WFUSP-1 values).

1. The 13 IIDL items added to Waves 2 and 3 had half-year stability values, butthey did not have one-year stability coefficients due to not being assessed at Wave1. Half-year stability values tended to be somewhat higher than one-year stabilityvalues (.541 half-year vs. .495 one-year stabilities for the 58 IIDL items assessed inall three waves). To keep from artificially inflating the stability values of the 13IIDL items due to their lack of a one-year stability estimate, we computedexpected one-year stability scores for these items and then averaged their observedhalf-year stability values with their expected one-year stability values.

Wood & Wortman14

RESULTS

Given that the same analyses were conducted for the three separateinventories, we organize the results section by the analysis and thendescribe the results that were found for each inventory and acrossinventories. As our hypotheses regarding the sources of differentialstability across trait items center on relationships between itemmeans, variabilities, and desirabilities, and how these are associatedwith both artifactual and real influences on trait stability estimates,we first focus on documenting the associations between desirabilityand endorsement ratings of traits, and then focus on how theseproperties are associated with trait stability estimates.

Relationship Between Trait Endorsement Means (MX) andDesirability Means (MD)

Replicating previous research (Edwards, 1953), we found a verystrong relationship between a trait’s mean and the trait’s mean desir-ability. The correlation between item endorsement means (MX) anddesirability means (MD) across items ranged from .90 for the BFIand .91 for the GBFM, to .96 for the IIDL or .94 across the threeinventories if considered as a single sample controlling for inventoryas dummy variables. These extremely strong relationships betweenitem endorsements and desirabilities were found despite the fact thatthese estimates were collected in different samples, which shouldhave served to lower estimates of this relationship.

Replicating a finding originally noted by Anderson (1968), therewas a very clear bimodal distribution for the mean desirability oftrait items (MD). That is, there was a strong tendency for a trait to berated as either highly undesirable or highly desirable by most raters,and there were few traits rated as being about equally desirable andundesirable. Only 7 of the 155 total items were rated as about equallydesirable and undesirable (POMP desirability means between 45 and55). Four of these items involved aspects of Openness (BFI: preferswork that is routine, GBFM: simple, IIDL: traditional/conventional,radical/rebellious; the remaining were GBFM items emotional andreserved and the IIDL item feminine/unmasculine). Given theextremely high relationship between trait item desirabilities andendorsements, it is not surprising then that trait item means alsoshowed a similar although somewhat less stark bimodal distribution.

Trait Desirability and Stability 15

Relationship Between Trait Item Endorsement Means (MX) andStandard Deviations (SDX)

We first estimated the relationship between trait means and traitstandard deviations. As in an earlier investigation by Baird andcolleagues (2006), to account for expected item variability reductionsdue to ceiling and floor effects, an item’s observed standard deviationwas predicted from the item’s mean (MX) and mean-squared (MX

2)in a regression. As expected, an item’s standard deviation was veryhighly associated with the item’s mean; the multiple correlation (R)from these regressions ranged from a high of .92 for the BFI to .83for the IIDL and .88 for the GBFM (Table 1). Further, the mean-squared term was significant in all three regressions, suggesting thatthe relationship was significantly curvilinear. We then reparameter-ized these regression equations into this general form:

Expected SD B* M B* B*X X= −( ) +1 22

3

This reparameterization was done as the B*2 terms provide anestimate of where we can expect the highest variability in an item asa function of the item’s mean. The equations for estimating theexpected standard deviation of an item given the mean are givenseparately below for each inventory:

GBFM MX: . . .− −( ) +008 45 1 26 82

BFI MX: . . .− −( ) +007 44 4 28 72

IIDL MX: . . .− −( ) +007 43 6 24 32

Interestingly, although we had anticipated that items would havethe highest levels of variability if they had POMP means of 50 (i.e.,if the mean was at the scale midpoint), for all inventories itemswere instead expected to have the highest levels of variability whenthe mean was at about 45% of the maximum scale range (i.e.,endorsed on average slightly less than the neutral scale midpoint atPOMP = 50). The negative B*1 values indicate that, as expected, itemvariabilities (SDX) decrease away from this value. A graph show-ing the curvilinear relationship between item means and standarddeviations across the three inventories is given in Figure 3. Thisfigure illustrates that, supporting Hypothesis 2, there was a verystrong curvilinear relationship between an item’s mean and standard

Wood & Wortman16

Tab

le1

Rel

ati

onsh

ips

Bet

wee

nM

ean

sa

nd

Sta

nd

ard

Dev

iati

ons

ofTr

ait

Ra

tin

gs

an

dTr

ait

Des

ira

bil

ity

Ra

tin

gs

Inve

ntor

y(S

ampl

e)M

(SD

)

rW

ith

Item

Pro

pert

yC

ontr

ollin

gfo

rM

Xan

dM

X2

SDD

MX

and

MX

2SD

D

Var

iabi

lity

inen

dors

emen

t,SD

X

GB

FM

(UIU

C-G

)25

.0(3

.7)

.26 n

s.8

8 a.1

2 ns

BF

I(U

IUC

-D)

23.2

(3.7

).6

4.9

2 a.4

8II

DL

(WF

U-D

)19

.3(3

.5)

.57

.83 a

.54

Var

iabi

lity

inde

sira

bilit

y,SD

D

GB

FM

(WF

USP

-1)

19.0

(2.6

)—

.46 a

—B

FI

(WF

USP

-2)

22.1

(2.1

)—

.54 a

—II

DL

(WF

USP

-1)

19.8

(2.9

)—

.52 a

Not

e.Su

bscr

ipt“

ns”

indi

cate

sas

soci

atio

nis

nots

tati

stic

ally

sign

ifica

nt(p

>.0

5);a

llot

her

corr

elat

ions

are

stat

isti

cally

sign

ifica

nt.S

ubsc

ript

“a”

indi

cate

sth

atth

eva

lue

isa

mul

tipl

eR

coef

ficie

nt,

whe

reth

eva

riab

ility

coef

ficie

nt(S

DX

orSD

D)

ispr

edic

ted

from

the

mea

nan

dm

ean-

squa

red

oftr

ait

endo

rsem

ent

rati

ngs

(MX

and

MX

2 )or

trai

tde

sira

bilit

yra

ting

s(M

Dan

dM

D2 ).

Cel

lsw

ith

valu

e“—

”in

dica

teth

atth

eva

lue

coul

dno

tbe

com

pute

d.

Trait Desirability and Stability 17

deviation, where trait items with means near the scale minimumor maximum (e.g., MX < 20 or MX > 80) had considerably smallerstandard deviations.

Relationship Between Means and Variabilities of Trait ItemEndorsements and Desirabilities

We continued by exploring the relationships between the means andstandard deviations of both standard trait ratings and trait desira-bility ratings. These correlations are given for each instrument in

Figure 3Scatter plot displaying quadratic relationship between trait meansand standard deviation across inventories. For each item, “G”indicates the item is from the GBFM, “B” indicates the item is from

the BFI, and “I” indicates the item is from the IIDL.

Wood & Wortman18

Table 1. Correlations reported in the text are statistically significant(p < .05) unless otherwise indicated.

We found that items with greater variabilities in desirabilityratings (SDD) showed greater standard deviations (SDX). The small-est association was found for the GBFM (r = .26, p = .11), and largerassociations were found for the IIDL (r = .57) and the BFI (r = .64).The relationships between the SDX and SDD remained positive andsignificant even after controlling for item means and mean-squaredacross IIDL items (partial r = .54), and BFI items (partial r = .48), oracross items of all three inventories considered together when addi-tionally controlling for inventory (partial r = .42). This indicates thatvariabilities in trait desirability (SDD) and trait means (MX) showindependent associations with how much traits vary in endorsement(SDX).

Levels of Trait Simple Stability, Dependability, and SystematicStability, and their Interrelationships

To this point, we have focused on documenting the relationshipsbetween simple properties of trait items (see Figure 1). We continueby examining our central hypotheses about how these propertiesrelate to simple stability (s), dependability (d), and systematic stabil-ity (S*) estimates of trait items. As dependability estimates were notavailable for the GBFM items, a full examination of these relation-ships is only possible for the BFI and IIDL instruments.

First, we found that the trait items showed a similar amountof average simple stability across the three instruments, althoughthe GBFM items showed somewhat lower simple stability esti-mates (avg. r = .43) than the BFI and IIDL items (avg. r = .52). Thesedifferences could be due to the fact that the GBFM items consisted ofsimpler items (i.e., single adjectives) or were assessed over a some-what longer period (i.e., the GBFM simple stabilities were based ona 12-month retest interval, whereas the BFI items were based on theaverage estimates from 3-month and 8-month retest intervals, andthe IIDL items were based on the average estimates from 6-monthand 12-month retest intervals). As expected, both the BFI andIIDL items showed dependabilities that were somewhat higher thantheir simple stabilities (avg. r = .61). This indicated that there werechanges in how at least some individuals saw themselves on the BFIand IIDL items even over a period of about a year, as evidenced by

Trait Desirability and Stability 19

the fact that the item systematic stabilities observed for the BFI andIIDL items were frequently below 1 (avg. r = .85).2

For both the BFI and IIDL, item dependabilities tended to bevery highly associated with item stabilities (BFI: r = .70; IIDL:r = .86). This high relationship was particularly interesting for theBFI, given that the BFI item dependability and stability estimateswere based on data collected from different samples. Also, for bothinstruments, items with the highest simple stability estimates alsotended to have higher systematic stability estimates (BFI: r = .65;IIDL: r = .49); this should not be particularly surprising given thatthe stability estimates form the numerator of the equation used toestimate systematic stability (S* = s/d; Equation 2), although thesecorrelations are sufficiently modest to indicate that there are consid-erable differences in the ordering of which items show highest simplestabilities and highest systematic stabilities. Finally, item depend-abilities showed no association with item systematic stabilities (BFI:r = -.02; IIDL: r = -.03), indicating the independent nature of thesetypes of stability estimates. That is, items with more transient errorare not expected to have higher or lower levels of systematic stabilityover time.

Sources of Trait Item Differences in Stability, Dependability, andSystematic Stability

Our central hypotheses concerned how different properties of traititems are associated with simple stability estimates, dependabilityestimates, and systematic stability estimates. We continue by exam-ining how levels of these stability estimates may differ by Big Fivedomain, and then how they may differ as a function of item means(MX) and item variabilities in desirabilities (SDD).

2. Systematic stability coefficients (S* = s/d) are not expected to exceed 1.0, butthey may if an item’s observed stability estimate (s) is artificially low or if the item’sobserved dependability estimate (d) is artificially high. Consistent with this expec-tation, no IIDL items showed systematic stability estimates at or above 1.0(max S* = .54/.55 = .98 for the item confident/self-assured). Only one BFI itemshowed a systematic stability above 1.0 (the item has few artistic interests; S* = .56/.53 = 1.06). The low incidence of systematic stabilities estimated above 1.0 for theBFI items was somewhat surprising given that the dependability and stabilityestimates for these items were obtained from two different samples (the UI-S andUIUC-D samples).

Wood & Wortman20

Test of differences in stability coefficients by Big Five domain. Webegan with a test of whether there were differences in item simplestability coefficients that could be predicted across Big Five domains;this was investigated more formally though an F test for differencesacross Big Five domains. The GBFM and BFI items were classifiedby design into Big Five domains; to classify the IIDL items, we usedthe item classifications given in Wood, Harms et al. (2010, Table 1),and for the additional items not considered in that article, we alsoincluded behavioral traits that could be easily placed into Big Fivecategories (e.g., messy/sloppy as a conscientiousness item, satisfied/secure as an emotional stability item); 54 of the 71 IIDL items wereclassified in this manner.

We found significant differences in simple stability estimatesas a function of Big Five domain for all three inventories, BFI:F(4,39) = 6.59, GBFM: F(4,34) = 7.53, IIDL: F(4,49) = 2.61, allps < .05, and across Big Five dimensions considered across all inven-tories while controlling for inventory, F(4,130) = 19.15, p < .05. Asshown in Table 2, for all three instruments, items in the domain ofagreeableness had the lowest average simple stability, and itemsin the domain of extraversion were near the highest, with moreinconsistent results for items in the other trait domains.

We found indications that the differences in simple stabilityestimates across Big Five domains may reflect both differentialdependability and differential systematic stability as a function ofBig Five domain. Chmielewski and Watson (2009) reported apparentdifferences in dependabilities across Big Five domains, with extra-version being the most dependable and agreeableness being theleast dependable; we evaluated this more formally at the item levelof analysis and found there to be nearly significant differences independabilities across Big Five domains using their BFI dependabil-ity data, F(4,39) = 2.58, p = .052, and also significant dependabilitydifferences across Big Five domains for IIDL items, F(4,49) = 3.48,p < .05. There were also significant differences in dependabilityestimates by Big Five dimensions across the items of both inven-tories considered together, controlling for inventory, F(4,92) = 5.09,p < .05. Importantly, both inventories showed the same generalpattern, with extraversion, emotional stability, and Openness itemsshowing higher dependabilities than agreeableness items. As thislargely paralleled the nature of domain differences in simple stabilityestimates, this indicates that domain differences in simple stability

Trait Desirability and Stability 21

Tab

le2

Test

sof

Dif

fere

nce

sin

Va

ria

bil

ity

an

dSt

ab

ilit

yof

Tra

itIt

ems

by

Big

Five

Dom

ain

Inve

ntor

y(S

ampl

e)E

AC

ES

O/I

FT

est

Sim

ple

stab

ility

,sG

BF

M(U

IUC

-G)

.51

(.09

).3

0(.

07)

.48

(.11

).4

7(.

07)

.41

(.08

)F

(4,3

4)=

7.53

BF

I(U

IUC

-D)

.61

(.03

).4

5(.

08)

.48

(.07

).5

0(.

07)

.56

(.08

)F

(4,3

9)=

6.59

IID

L(W

FU

-D)

.54

(.09

).4

8(.

05)

.54

(.10

).4

8(.

04)

.55

(.07

)F

(4,4

9)=

2.61

Dep

enda

bilit

y,d

BF

I(I

owa-

SP)

.67

(.05

).5

7(.

06)

.58

(.08

).6

3(.

05)

.62

(.10

)F

(4,3

9)=

2.58

ns

IID

L(W

FU

-D)

.65

(.06

).5

7(.

05)

.63

(.10

).6

1(.

04)

.65

(.06

)F

(4,4

9)=

3.48

Syst

emat

icst

abili

ty,s

/dB

FI

.91

(.04

).7

9(.

13)

.83

(.06

).8

0(.

12)

.90

(.06

)F

(4,3

9)=

3.39

IID

L.8

3(.

06)

.85

(.07

).8

6(.

07)

.80

(.07

).8

5(.

07)

F(4

,49)

=1.

56ns

Not

e.C

ells

show

mea

nsan

dst

anda

rdde

viat

ions

ofst

abili

tyes

tim

ates

wit

hin

the

trai

tdo

mai

n.Su

bscr

ipt

“ns”

indi

cate

sst

abili

tyes

tim

ates

dono

tdi

ffer

acro

ssB

igF

ive

dom

ains

(p>

.05)

.E

=ex

trav

ersi

on,

A=

agre

eabl

enes

s,C

=co

nsci

enti

ousn

ess,

ES

=em

otio

nal

stab

ility

,O

/I=

Ope

nnes

s/In

telle

ct.

Wood & Wortman22

estimates reflect, in part, differences in the amount of transienterror across Big Five domains (e.g., agreeableness items have moretransient error than extraversion items).

However, there was also some evidence that there were differencesin item systematic stability coefficients that could be predicted byBig Five domain for the BFI, F(4,39) = 3.39, p < .05, and IIDL,F(4,61) = 1.56, p = .20, and across the items of both inventories con-sidered together, controlling for inventory, F(4,92) = 2.90, p < .05.The results indicate that when measured without transient error,Openness items show slightly higher systematic stabilities than emo-tional stability items, but these differences were small in magnitude.

Relationships between item means and desirabilities with simplestability estimates. The relationships between mean item endorse-ments (MX), item variabilities in endorsement (SDX), and item vari-abilities in desirability (SDD) with the three types of item stabilityestimates—simple stability, dependability, and systematic stability—are given in Table 3.

Supporting Hypotheses 1 and 2, items that had higher standarddeviations, or that had means near the scale midpoint, had highersimple stability estimates (rs � .55). Supporting Hypothesis 3, itemsthat varied more in how much individuals saw them as desirable weresignificantly more stable in all inventories (rs � .34, p < .05).

Again, the idea that item simple stability estimates are associatedwith the extent to which traits varied in their desirability was ofparticular interest, so this relationship was investigated further bycreating a scatter plot depicting how an item’s variability in desir-ability (SDD) was associated with the item’s simple stability (s) acrossitems (see Figure 4). As item simple stabilities and desirability vari-abilities were estimated using data from different samples acrossthe instruments, simple stability and desirability variability estimateswere standardized within instrument and then transformed intothe same units as observed among IIDL items in order to place theestimates from the different inventories on a comparable metric.

The strong relationship between an item’s variability in desira-bility (SDD) and the item’s simple stability (s) can be seen in Figure 4.A closer inspection of Figure 4 reveals that there were impressiveregularities in the type of content that was empirically observed to bemost stable and to vary most in desirability across instruments. Someof these findings were consistent with the Big Five domain differences

Trait Desirability and Stability 23

presented in Table 2, although there were also indications of a morenuanced picture of stability differences across content even con-sidered to be within a given Big Five trait domain. The most stableitems and the items with the highest variability in rated desirabilitytended to involve various aspects of extraversion (e.g., BFI: has anassertive personality, is talkative; GBFM: introverted, reserved, talk-ative; IIDL: controlling/dominant, bashful/shy; outgoing/extraverted),conventionality and artistic aspects of openness to experience(e.g., BFI: is sophisticated in art/music/literature; GBFM: creative;IIDL: creative/artistic, traditional/conventional, strange/weird), andorganization aspects of conscientiousness (e.g., BFI: tends to bedisorganized; GBFM: organized, neat; IIDL: messy/sloppy; organized/efficient). In contrast, the content that was observed to be leaststable and that varied least in rated desirability tended to indi-cate aspects of agreeableness (e.g., BFI: is helpful and unselfishto others, likes to cooperate with others; GBFM: kind, distrustful,sympathetic, cooperative; IIDL: pleasant/agreeable, cruel/abusive,

Table 3Relationships Between Trait Item Means (MX), Endorsement

Variabilities (SDX), and Desirability Variabilities (SDD) With TraitStability Estimates

Inventory (Sample) M (SD)

r With Item Property

MX and MX2 SDX SDD

Simple stability, sGBFM (UIUC-G) .43 (.11) .66a .71 .34BFI (UIUC-D) .52 (.09) .66a .74 .63IIDL (WFU-D) .52 (.08) .55a .66 .58

Dependability, dBFI (UI-S) .61 (.08) .67a .73 .43IIDL (WFU-D) .62 (.08) .55a .66 .49

Systematic stability, s/dBFI .85 (.10) .35ns,a .30 .45IIDL .84 (.07) .18ns,a .16ns .30

Note. Subscript “a” indicates that the column “MX and MX2” shows the multiple

correlation (R) between stability or agreement as predicted by the trait mean andsquared trait mean in a simultaneous regression; these are controlled in latercolumns.

Wood & Wortman24

unsympathetic/unfriendly, crabby/grouchy), emotional stability (e.g.,BFI: remains calm in tense situations, can be tense; GBFM: anxious,fretful; IIDL: unstable/disturbed, satisfied/secure), intelligence (e.g.,GBFM: intellectual, bright; IIDL: dumb/stupid, skilled/talented), and

Figure 4Scatter plot displaying relationship between variability in traitdesirability and trait simple stability estimates across instruments,where simple stability (s) and variability in desirability (SDD) esti-mates have been scaled to be in the same units as the levelsobserved in the IIDL instrument, as assessed in the WFU-D sample(r = .53). For each item, “G” indicates the item is from the GBFM,“B” indicates the item is from the BFI, and “I” indicates the item is

from the IIDL.

Trait Desirability and Stability 25

reliability aspects of conscientiousness (BFI: does things efficiently, isa reliable worker, makes plans and follows through with them; IIDL:unreliable/undependable, dependable/reliable).

As noted earlier, there were extremely high relationships betweenitem desirability means (MD) and endorsement means (MX). It is thuspossible that the relationship between item variabilities in desirabil-ity (SDD) and simple stabilities (s) is entirely driven by producingrestriction of range artifacts for very desirable or undesirable traits.We explored this by examining the relationships between meansand standard deviations of item desirabilities, controlling for itemmeans. However, even after controlling for the quadratic effect ofitem means on stability, a sizable relationship remained between itemvariabilities in desirability and item simple stabilities for the BFI(partial r = .44) and IIDL (partial r = .41), but not the GBFM(partial r = .17, p = .32), and the effect was observed across the threeinventories considered simultaneously, controlling for inventory(partial r = .32).

Associations between trait means with all three stability estimates.As we have noted, our expectation is that the means of trait itemsrelate to stability estimates primarily through a restriction of rangeartifact, rather than indicating which traits truly show greater stabil-ity. It is possible to shed additional light on this explanation of theassociation between trait means and stability coefficients by examin-ing how trait means are associated with trait dependability and sys-tematic stability coefficients. As noted earlier, dependabilities (andconsequently, systematic stabilities; Equation 2) were available forthe BFI and IIDL instruments, making it possible to examine theserelationships for these two instruments. We report the multiple cor-relation (R) between an item’s mean and mean-square in predictingstability coefficients to show the association predicted through itemmeans.

For the BFI, item means were as associated with item depend-abilities (R = .67) as with item simple stabilities (R = .66). However,consistent with our expectations (Hypothesis 4a), item means werenot significantly associated with item systematic stability coefficients(R = .35, p = .07). We found the same pattern for the IIDL: itemmeans were as associated with trait dependabilities (R = .55) as withitem stabilities (R = .55) and were not significantly associated withitem systematic stability coefficients (R = .18, p = .31). These findings

Wood & Wortman26

are consistent with our interpretation that trait items with very highor very low means have lower stability coefficients mainly due tobeing infused with more transient error, rather than indicating thatthese traits have truly less stable rank orderings of how individualssystematically see themselves over time.

Associations between trait variability in desirability with all threestability estimates. Again, we expected that traits that vary more intheir perceived desirability will be more stable than others. If indi-viduals direct their developmental energies toward developing a traitin the direction they perceive as desirable, then traits with morevarying desirabilities will have people actively trying to maintaintheir trait levels, which will serve to increase stability in the orderingof individuals on the trait dimension over time. The validity of thisinterpretation can also be explored by examining how trait variabili-ties in desirability are associated with all three types of stabilitycoefficients.

Within the BFI, item variabilities in desirability (SDD) were asso-ciated somewhat less with trait dependabilities (r = .43) than withsimple stability coefficients (r = .63). However, consistent with ourexpectations (Hypothesis 4b), item variabilities in desirability weresignificantly associated with item systematic stabilities (r = .45).Again, we found a nearly identical pattern with the IIDL: itemvariabilities in desirability were associated somewhat less with itemdependabilities (r = .49) than with simple stabilities (.58). However,item variabilities in desirability were significantly associated withitem systematic stabilities (r = .30). Across the two inventories con-sidered together when controlling for inventory, item variabilities indesirability were associated with item systematic stabilities (partialr = .36) and remained significant after additionally controlling forthe item’s Big Five classification (partial r = .34), indicating that evenwithin a particular Big Five domain, items varying more in theirsimple stability had higher systematic stabilities in the rank orderingof individuals over time.

Given our interest in the relationship between variability in traitdesirability and trait systematic stability, this relationship was inves-tigated further by creating a scatter plot depicting how variation intrait desirability was associated with trait systematic stability acrossitems (see Figure 5); again, the systematic stabilities for the BFI itemswere put in the units of those found for the IIDL items. As with

Trait Desirability and Stability 27

simple stability coefficients, there appeared to be some general ten-dencies observed in the types of content observed to be most andleast systematically stable. Many of the traits with the highest sys-tematic stability estimates (S*) involved aspects of extraversion

Figure 5Scatter plot displaying relationship between variability in traitdesirability and trait systematic stability across BFI and IIDL instru-ments, where systematic stability (S*) and variability in desirability(SDD) estimates have been scaled to be in the same units as thelevels observed in the IIDL instrument, as assessed in the WFU-Dsample (r = .36). For each item, “B” indicates the item is from the

BFI, and “I” indicates the item is from the IIDL.

Wood & Wortman28

related to social agency and prominence (BFI: has an assertivepersonality; IIDL: controlling/dominant; prominent/well-known) andenergy (BFI: generates a lot of enthusiasm, is full of energy; IIDL:exciting/fascinating, outgoing/extraverted); and aspects of Opennessrelated to artistic tendencies (BFI: has few artistic interests, valuesartistic/aesthetic experiences; IIDL: creative/artistic) and conven-tionality (IIDL: weird/strange, radical/rebellious, ordinary/average).Items indicating degree of artistic and traditional tendencies in par-ticular seemed fairly consistent to both vary more in their desirabilityand to show higher systematic stability over time.

In contrast, trait items with the lowest systematic stability esti-mates tended to be aspects of agreeableness related to antisocialfeelings and behavior (BFI: starts quarrels with others, tends to findfault with others; IIDL: angry/hostile, crabby/grouchy) and thought-fulness (BFI: is helpful and unselfish with others; IIDL: affectionate/passionate; giving/generous), as well as items related to emotionalstability and anxiety (BFI: is emotionally stable/not easily upset;is relaxed/handles stress well, remains calm in tense situations;IIDL: stable/well-adjusted, satisfied/secure, unstable/disturbed, afraid/scared) and low positive affectivity (BFI: is depressed/blue; IIDL:sad/unhappy; lonely/lonesome). Items for all of these traits seemedfairly consistent to both vary less in their perceived desirability andshow lower systematic stability over time.

DISCUSSION

In the current research, we explored properties of traits and traititems that may influence trait stability estimates. First, we foundthat stability was higher among trait items with greater standarddeviations (Hypothesis 1) and was lower for trait items with meansnear the scale’s maximum or minimum possible value (Hypothesis2). We expected that differences in item means would lead to dif-ferences in simple stability estimates through an artifact relatedto restriction of range, and consistent with this, we found that itemmeans were particularly associated with item dependabilities, butwere largely unassociated with item systematic stabilities (Hypoth-esis 4a). In contrast, we also found traits that varied more in theirperceived desirability across individuals to show higher simple sta-bilities over time (Hypothesis 3). We expected that traits varying

Trait Desirability and Stability 29

more in their perceived desirability would be traits with truly morestable orderings of individuals over time, and consistent with this,we found that unlike trait means, traits that varied more in theirdesirability were associated with higher systematic stability esti-mates (Hypothesis 4b).

As noted earlier, there has been little agreement about whichpersonality traits are more stable than others, or even whetherlevels of stability actually vary across trait dimensions (e.g., Roberts& DelVecchio, 2000). This study provides powerful support for theargument that trait differences in stability are both real and repli-cable, and it considerably refines our understanding of what thesedifferences are and why they arise. First, we were able to replicatethe general contours of stability differences across Big Five domainsthat had been outlined in prior investigations across three differentinventories and samples, with items in the domain of extraversionconsistently showing higher stabilities and items in the domains ofagreeableness and emotional stability showing lower stabilitiesthan items in other Big Five domains (e.g., Roberts & DelVecchio,2000; Vaidya et al., 2008). Although more exploratory, we alsofound indications of differential stability even among traits consid-ered to be within the same Big Five domain. For instance, in thedomain of Openness, items related to traditionalism and artistictendencies appeared to have higher stabilities than items related tointellect; in the domain of extraversion, items related to assertive-ness and energy appeared to be more stable than items related topositive affect; in the domain of agreeableness, items related toanger and hostility consistently showed the lowest stabilities; andin the domain of conscientiousness, items related to organizationappeared to be more stable than items related to reliability.Although these finer distinctions remain to be replicated in futureresearch, they are generally consistent with our expectation thattraits varying more in their desirability will be more stable (Hypoth-eses 3 and 4b). It is important to note also that this general hypoth-esis was consistently strongly supported across all three instruments,despite the fact that the critical tests were conducted using itemstability and desirability estimates that were invariably collected indifferent samples. This increases our confidence that the generalcontours of which traits we found here to be most and least stableover time are general properties of the traits that will generalize toother samples.

Wood & Wortman30

Identifying Artifactual and Real Influences on Trait StabilityEstimates

There are clear psychometric reasons to suspect that traits withgreater variability will have greater stability—and indeed shouldcorrelate more with any variable—due to the role of a variable’srange in influencing the expected magnitude of correlations (Cohenet al., 2003). Although we documented the considerable extent towhich differences in variability across trait estimates are associatedwith trait differences in estimated stability, cross-trait comparisonsin the level of variability are almost never examined as an empiricalquestion. We suspect this is due to the difficult questions it raises inshifting from psychometric to substantive interpretations of theserelationships. For instance, when would we be comfortable sayingthat one trait varies more than another? Although we can empiri-cally evaluate whether some traits vary more than others by forcingthem onto a standard agree/disagree response scale as we havedone here (e.g., how much do you agree that you are nice? talk-ative? smart?), this presents issues of its own. Most important tothe present investigation, there is the issue of the operation ofceiling and floor effects. As we show here, differences in variabilitydriven by differences in rates of endorsement can cause us to con-clude that certain dimensions vary less and are less stable thanothers due to the fact that people at the high or low ends of thedimension are not able to discriminate themselves in a way theymight be able to if a broader range of scale points were available.We show that items with means near the scale maximum orminimum have considerably lower stability estimates than itemswith means near the scale midpoint, but this seems to occur mainlyby decreasing the measure’s dependability—measures with suchmeans do not seem to actually be less stable after accountingfor measurement dependability (see Table 3). As trait differences independability estimates closely track trait differences in simple sta-bility estimates, it is clear that this artifact should be accounted forin studies of differential stability, as doing so alters the picture ofwhich traits are most and least stable over time, as can be seen bycomparing Figures 4 and 5.

In contrast, the finding that traits varying more in their desirabil-ity are more stable over time is supportive of the broader suggestionthat a person’s trait level is in part motivated, and that people

Trait Desirability and Stability 31

actively strive—and have some success—in developing and main-taining the characteristics they perceive to be desirable (Hogan &Roberts, 2004; Roberts et al., 2005, 2008; Wood et al., 2007). Fromthis perspective, the heightened observed stability of individual dif-ferences in traits related to traditionalism, dominance, masculinity,and creativity is thought to be driven by the fact that many individu-als view these traits as being particularly desirable and thus takeadditional efforts to increase or maintain their high levels of thesetraits, but many other individuals view these characteristics as neu-trally desirable or even undesirable, and consequently either do nottake such efforts or actively attempt to decrease or maintain their lowlevels of the traits. In contrast, the lower stability of terms relatedto hostility, anxiety, reliability, and sadness could be due to the factthat people almost consensually agree in perceiving these traits asbeing either highly desirable or highly undesirable, and thus thesame mechanism of an individual effortfully developing traits heor she sees as desirable will direct almost all people to develop theirtrait levels toward a single pole of the trait dimension. This, in turn,should result in smaller variation across people over the potentialrange of trait-related behavior, and also a greater likelihood thatsome individuals will “leapfrog” others who previously exceededtheir own trait levels due to differential success in their self-developmental efforts.

Future Directions

There is an interesting flip side to our suggestion that traits that areless consensually desired should have increased trait stability due toindividuals effortfully developing their characteristics in the direc-tion they find desirable. When almost everyone views a trait to bedesirable, this should serve as a substantial force pushing for increas-ing levels of the trait with age across the entire population, as almosteveryone desires to change his or her level of the trait in the samedirection. Existing research on normative developmental patternsover time seem to support this point, as the traits that increase themost over time actually do appear to be those that are most consen-sually desired and least stable from our estimates, with traits relatedto agreeableness and emotional stability showing greater mean-levelincreases over time than traits related to extraversion and Openness(Roberts, Walton, & Viechtbauer, 2006; Srivastava, John, Gosling,

Wood & Wortman32

& Potter, 2003; Wood et al., 2007). Indeed, emerging evidence isconsistent with the idea that this may even be true at the facet level,where there appears to be greater mean-level changes for the moreconsensually desired reliability aspects of conscientiousness than forthe less consensually desired organization aspects (Jackson et al.,2009; Soto, John, Gosling, & Potter, 2011). This indicates just oneof a number of ways in which the role of trait desirabilities couldbe explored more generally in studies of personality stability andchange.

This study also suggests ways in which studies of personalitystability should be conducted more generally. In particular, ourresults strongly point to the importance of accounting for therole of transient measurement error through dependability esti-mates when examining differences in stability across measures(Chmielewski & Watson, 2009; Watson, 2004). In this investigation,correcting for transient error through dependability estimatesallowed us to increase confidence in our central hypotheses thattrait item means are highly associated with trait differences in sta-bility estimates through a restriction of range artifact rather thanreflecting “real” differences in trait stabilities, whereas traits varyingmore in desirability are actually more stable over time through themotivated maintenance of trait levels. Although dependability esti-mates are rarely collected due to requiring the resurveying of asample over an unusually short time interval, we believe they arecrucial to collect in studies examining the stability of a measure ortrait for at least two reasons. First, dependability estimates allowfor the correction of transient error at the level of single items, asdone in the present investigation, which is simply not possiblethrough the use of the standard Cronbach’s alpha statistic (as thestandard alpha statistic is based on inter-item correlations). This isan extremely important advantage, as studies of differences in traitproperties across traits are best served by surveying as broad arange of traits as possible (e.g., Funder & Dobroth, 1987). Second,although seldom used, dependability estimates are probably theconceptually appropriate correction for measurement unreliabilityin studies of trait stability anyway. Since measurement stability isestimated by correlating scores of the same measure some timeapart, the important type of unreliability to be accounted for isinconsistency in responses to the same measure when the time inter-val is so short that inconsistencies should represent negligible “true”

Trait Desirability and Stability 33

change in the characteristic (Cattell et al., 1970; Chmielewski &Watson, 2009; McCrae et al., 2011; Watson, 2004).

Our understanding of which traits are more stable than others andwhy is in its infancy. Naturally, this study has certain limitations thatshould be addressed in future studies of differential stability. Forinstance, despite investigating these hypotheses with several distinctsamples of participants, nearly all samples used here were collegestudents, and the longest test-retest interval was only a year. Thepicture of which traits are most and least stable could differ appre-ciably in studies examining older samples and over longer periods oftime—particularly as there are some indications that age may beassociated with changes in the degree to which some traits are seen asdesirable (e.g., Wood et al., 2007). Additionally, although this studyexamined a broader range of traits than most studies of trait stability(e.g., not just the principle axes of the Big Five), future studies oftrait stability should continue to expand the range of traits that areexamined. Finally, this study is unique in that item dependability,desirability, and stability estimates from a single instrument weregenerally obtained from different samples. Although this featureshould increase confidence that the documented relationships reflectreal properties of traits and trait items rather than simply idiosyn-cratic or problematic characteristics of one sample, this feature alsoprecludes some important tests that could be explored if all estimateswere obtained from a single sample. For instance, such a designwould allow a test of the surprisingly underexamined assumptionunderlying the current study that what a particular individual findsdesirable is associated with the individual’s own trajectory of per-sonality development. The ability to test this hypothesis will greatlyimprove our understanding of the processes underlying the currentfindings and personality development more generally.

The current study advances our understanding of trait differencesin stability in several ways. First, the current research provides strongindications that individual differences in some personality traits areindeed more stable over time than others by showing that individualdifferences in self-reports of content associated with extraversion andtraditionalism are regularly more stable than self-reports of contentassociated with agreeableness, emotional stability, or reliability.Second, this research indicates that differences across trait items insimple stability estimates are likely produced in part by an artifactassociated with ceiling or floor effects that regularly accompany the

Wood & Wortman34

assessment of certain traits, which can be accounted for by correctingfor a measure’s level of dependability. Finally, our findings alsoindicate that trait differences in stability estimates are likely pro-duced by differences in the extent to which the traits are con-sensually desired, with less consensually desired traits being morestable over time. Unlike traits with means near the scale midpoint,traits varying more in desirability seem to truly have more stablerank orderings of individuals over time, and this relationship persistseven after accounting for the role of transient error on stabilityestimates. Although further research needs to be conducted, thisfinal finding provides compelling evidence that traits do in factvary in their rank-order stabilities over time, and that these differ-ences can be understood by appreciating the role of individualsactively developing their own trait levels in the directions theyperceive to be desirable.

REFERENCES

Anderson, N. (1968). Likableness ratings of 555 personality-trait words. Journal ofPersonality and Social Psychology, 9, 272–279.

Baird, B., Le, K., & Lucas, R. (2006). On the nature of intraindividual personalityvariability: Reliability, validity, and associations with well-being. Journal ofPersonality and Social Psychology, 90, 512–527.

Baltes, P. B. (1997). On the incomplete architecture of human ontogeny: Selection,optimization, and compensation as foundation of developmental theory.American Psychologist, 52, 366–380.

Caspi, A., Roberts, B., & Shiner, R. (2005). Personality development: Stabilityand change. Annual Review of Psychology, 56, 453–484.

Cattell, R. B., Eber, H. W., & Tatsuoka, M. M. (1970). Handbook for the SixteenPersonality Factor Questionnaire (16PF). Champaign, IL: Institute for Person-ality and Ability Testing.

Chmielewski, M., & Watson, D. (2009). What is being assessed and why it matters:The impact of transient error on trait research. Journal of Personality andSocial Psychology, 97, 186–202.

Cohen, P., Cohen, J., Aiken, L. S., & West, S. G. (1999). The problem of units andthe circumstances for POMP. Multivariate Behavioral Research, 34, 315–346.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multipleregression/correlation analysis for the behavior sciences (3rd ed.). Mahwah, NJ:Erlbaum.

Conley, J. J. (1984). The hierarchy of consistency: A review and model of longi-tudinal findings on adult individual differences in intelligence, personality andself-opinion. Personality and Individual Differences, 5, 11–25.

Costa, P. T., Jr., McCrae, R. R., & Arenberg, D. (1980). Enduring dispositions ofadult males. Journal of Personality and Social Psychology, 5, 793–800.

Trait Desirability and Stability 35

Edwards, A. L. (1953). The relationship between the judged desirability of atrait and the probability that the trait will be endorsed. Journal of AppliedPsychology, 37, 90–93.

Eid, M., & Diener, E. (1999). Intraindividual variability in affect: Reliability,validity, and personality correlates. Journal of Personality and Social Psy-chology, 76, 662–676.

Ferguson, C. J. (2010). A meta-analysis of normal and disordered person-ality across the life span. Journal of Personality and Social Psychology, 98,659–667.

Fleeson, W. (2007). Situation-based contingencies underlying trait-content mani-festation in behavior. Journal of Personality, 75, 825–862.

Fujita, F., & Diener, E. (2005). Life satisfaction set point: Stability and change.Journal of Personality and Social Psychology, 88, 158–164.

Funder, D. C. (1995). On the accuracy of personality judgment: A realisticapproach. Psychology Review, 102, 652–670.

Funder, D., & Dobroth, K. (1987). Differences between traits: Properties associ-ated with interjudge agreement. Journal of Personality and Social Psychology,52, 409–418.

Goldberg, L. R. (1992). The development of markers for the Big-Five factorstructure. Personality Assessment, 4, 26–42.

Hampson, S. E., & Goldberg, L. R. (2006). A first large cohort study of person-ality trait stability over the 40 years between elementary school and midlife.Journal of Personality and Social Psychology, 91, 763–779.

Hogan, R., & Roberts, B. W. (2004). A socioanalytic model of maturity. Journalof Career Assessment, 12, 207–217.

Jackson, J., Bogg, T., Walton, K., Wood, D., Harms, P., Lodi-Smith, J., et al.(2009). Not all conscientiousness scales change alike: A multimethod, multi-sample study of age differences in the facets of conscientiousness. Journal ofPersonality and Social Psychology, 96, 446–459.

John, O. P., & Robins, R. W. (1993). Determinants of interjudge agreement onpersonality traits: The Big Five domains, observability, evaluativeness, and theunique perspective of the self. Journal of Personality, 61, 521–551.

John, O. P., & Srivastava, S. (1999). The Big Five trait taxonomy: History,measurement, and theoretical perspectives. In L. A. Pervin & O. P. John (Eds.),Handbook of personality (pp. 102–138). New York: Guilford Press.

McCrae, R. R., Kurtz, J. E., Yamagata, S., & Terracciano, A. (2011). Internalconsistency, retest reliability, and their implications for personality scalevalidity. Personality and Social Psychology Review, 15, 28–50.

Noftle, E. E., & Fleeson, W. (2010). Age differences in Big Five behavior averagesand variabilities across the adult life span: Moving beyond retrospective, globalsummary accounts of personality. Psychology and Aging, 25, 95–107.

Paunonen, S. V., & Jackson, D. N. (1985). Idiographic measurement strategies forpersonality and prediction: Some unredeemed promissory notes. PsychologicalReview, 92, 486–511.

Roberts, B. W., & DelVecchio, W. F. (2000). The rank-order consistency ofpersonality from childhood to old age: A quantitative review of longitudinalstudies. Psychological Bulletin, 126, 3–25.

Wood & Wortman36

Roberts, B. W., Walton, K. E., & Viechtbauer, W. (2006). Patterns of mean-levelchange in personality traits across the life course: A meta-analysis of longitu-dinal studies. Psychological Bulletin, 132, 1–25.

Roberts, B. W., Wood, D., & Caspi, A. (2008). The development of personalitytraits in adulthood. In O. P. John, R. W. Robins, & L. A. Pervin (Eds.),Handbook of personality (pp. 375–398). New York: Guilford Press.

Roberts, B., Wood, D., & Smith, J. (2005). Evaluating five factor theory and socialinvestment perspectives on personality trait development. Journal of Researchin Personality, 39, 166–184.

Robinson, M. D., & Clore, G. L. (2002). Belief and feeling: Evidence for anaccessibility model of emotional self-report. Psychological Bulletin, 128, 934–960.

Schwarz, N., & Clore, G. L. (1983). Mood, misattribution, and judgments ofwell-being: Informative and directive functions of affective states. Journal ofPersonality and Social Psychology, 45, 513–523.

Soto, C. J., John, O. P., Gosling, S. D., & Potter, J. (2011). Age differences inpersonality traits from 10 to 65: Big Five domains and facets in a large cross-sectional sample. Journal of Personality and Social Psychology, 100, 330–348.

Srivastava, S., John, O. P., Gosling, S. D., & Potter, J. (2003). Development ofpersonality in early and middle adulthood: Set like plaster or persistent change?Journal of Personality and Social Psychology, 84, 1041–1053.

Walton, K., & Roberts, B. W. (2004). On the relationship between substance useand personality traits: Abstainers are not maladjusted. Journal of Research inPersonality, 38, 515–535.

Watson, D. (2004). Stability versus change, dependability versus error: Issues inthe assessment of personality over time. Journal of Research in Personality, 38,319–350.

Wood, D., Gosling, S., & Potter, J. (2007). Normality evaluations and theirrelation to personality traits and well-being. Journal of Personality and SocialPsychology, 93, 861–879.

Wood, D., Harms, P., & Vazire, S. (2010). Perceiver effects as projective tests:What your perceptions of others say about you. Journal of Personality andSocial Psychology, 99, 174–190.

Wood, D., & Hensler, M. (2011). How a functionalist understanding of behavior canexplain trait variation and covariation without the use of latent factors. Retrievedfrom http://hdl.handle.net/10339/36461

Wood, D., Nye, C., & Saucier, G. (2010). Identification and measurement of amore comprehensive set of person-descriptive trait markers from the Englishlexicon. Journal of Research in Personality, 44, 257–272.

Vaidya, J., Gray, E., Haig, J., Mroczek, D., & Watson, D. (2008). Differentialstability and individual growth trajectories of Big Five and affective traitsduring young adulthood. Journal of Personality, 76, 267–304.

Vaidya, J., Gray, E., Haig, J., & Watson, D. (2002). On the temporal stability ofpersonality: Evidence for differential stability and the role of life experiences.Journal of Personality and Social Psychology, 83, 1469–1484.

Trait Desirability and Stability 37