Modules Chapter 6 Week 2 p655
C H A P T E R 6
In everyday language we say that something is valid if it is sound, meaningful, or well grounded on principles or evidence. For example, we speak of a valid theory, a valid argument, or a valid reason. In legal terminology, lawyers say that something is valid if it is “executed with the proper formalities” (Black, 1979), such as a valid contract and a valid will. In each of these instances, people make judgments based on evidence of the meaningfulness or the veracity of something. Similarly, in the language of psychological assessment, validity is a term used in conjunction with the meaningfulness of a test score—what the test score truly means.
The Concept of Validity
Validity , as applied to a test, is a judgment or estimate of how well a test measures what it purports to measure in a particular context. More specifically, it is a judgment based on evidence about the appropriateness of inferences drawn from test scores.1 An inference is a logical result or deduction. Characterizations of the validity of tests and test scores are frequently phrased in terms such as “acceptable” or “weak.” These terms reflect a judgment about how adequately the test measures what it purports to measure.
Inherent in a judgment of an instrument’s validity is a judgment of how useful the instrument is for a particular purpose with a particular population of people. As a shorthand, assessors may refer to a particular test as a “valid test.” However, what is really meant is that the test has been shown to be valid for a particular use with a particular population of testtakers at a particular time. No test or measurement technique is “universally valid” for all time, for all uses, with all types of testtaker populations. Rather, tests may be shown to be valid within what we would characterize as reasonable boundaries of a contemplated usage. If those boundaries are exceeded, the validity of the test may be called into question. Further, to the extent that the validity of a test may diminish as the culture or the times change, the validity of a test may have to be re-established with the same as well as other testtaker populations.Page 176
JUST THINK . . .
Why is the phrase valid test sometimes misleading?
Validation is the process of gathering and evaluating evidence about validity. Both the test developer and the test user may play a role in the validation of a test for a specific purpose. It is the test developer’s responsibility to supply validity evidence in the test manual. It may sometimes be appropriate for test users to conduct their own validation studies with their own groups of testtakers. Such local validation studies may yield insights regarding a particular population of testtakers as compared to the norming sample described in a test manual. Local validation studies are absolutely necessary when the test user plans to alter in some way the format, instructions, language, or content of the test. For example, a local validation study would be necessary if the test user sought to transform a nationally standardized test into Braille for administration to blind and visually impaired testtakers. Local validation studies would also be necessary if a test user sought to use a test with a population of testtakers that differed in some significant way from the population on which the test was standardized.
JUST THINK . . .
Local validation studies require professional time and know-how, and they may be costly. For these reasons, they might not be done even if they are desirable or necessary. What would you recommend to a test user who is in no position to conduct such a local validation study but who nonetheless is contemplating the use of a test that requires one?
One way measurement specialists have traditionally conceptualized validity is according to three categories:
1. Content validity. This is a measure of validity based on an evaluation of the subjects, topics, or content covered by the items in the test.
2. Criterion-related validity. This is a measure of validity obtained by evaluating the relationship of scores obtained on the test to scores on other tests or measures
3. Construct validity. This is a measure of validity that is arrived at by executing a comprehensive analysis of
a. how scores on the test relate to other test scores and measures, and
b. how scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure.
In this classic conception of validity, referred to as the trinitarian view (Guion, 1980), it might be useful to visualize construct validity as being “umbrella validity” because every other variety of validity falls under it. Why construct validity is the overriding variety of validity will become clear as we discuss what makes a test valid and the methods and procedures used in validation. Indeed, there are many ways of approaching the process of test validation, and these different plans of attack are often referred to as strategies. We speak, for example, of content validation strategies, criterion-related validation strategies, and construct validation strategies.
Trinitarian approaches to validity assessment are not mutually exclusive. That is, each of the three conceptions of validity provides evidence that, with other evidence, contributes to a judgment concerning the validity of a test. Stated another way, all three types of validity evidence contribute to a unified picture of a test’s validity. A test user may not need to know about all three. Depending on the use to which a test is being put, one type of validity evidence may be more relevant than another.
The trinitarian model of validity is not without its critics (Landy, 1986). Messick (1995), for example, condemned this approach as fragmented and incomplete. He called for a unitary view of validity, one that takes into account everything from the implications of test scores in terms of societal values to the consequences of test use. However, even in the so-called unitary view, different elements of validity may come to the fore for scrutiny, and so an understanding of those elements in isolation is necessary.
In this chapter we discuss content validity, criterion-related validity, and construct validity; three now-classic approaches to judging whether a test measures what it purports to measure. Page 177Let’s note at the outset that, although the trinitarian model focuses on three types of validity, you are likely to come across other varieties of validity in your readings. For example, you are likely to come across the term ecological validity. You may recall from Chapter 1 that the term ecological momentary assessment (EMA) refers to the in-the-moment and in-the-place evaluation of targeted variables (such as behaviors, cognitions, and emotions) in a natural, naturalistic, or real-life context. In a somewhat similar vein, the term ecological validity refers to a judgment regarding how well a test measures what it purports to measure at the time and place that the variable being measured (typically a behavior, cognition, or emotion) is actually emitted. In essence, the greater the ecological validity of a test or other measurement procedure, the greater the generalizability of the measurement results to particular real-life circumstances.
Part of the appeal of EMA is that it does not have the limitations of retrospective self-report. Studies of the ecological validity of many tests or other assessment procedures are conducted in a natural (or naturalistic) environment, which is identical or similar to the environment in which a targeted behavior or other variable might naturally occur (see, for example, Courvoisier et al., 2012; Lewinski et al., 2014; Lo et al., 2015). However, in some cases, owing to the nature of the particular variable under study, such research may be retrospective in nature (see, for example, the 2014 Weems et al. study of memory for traumatic events).
Other validity-related terms that you will come across in the psychology literature are predictive validity and concurrent validity. We discuss these terms later in this chapter in the context of criterion-related validity. Yet another term you may come across is face validity (see Figure 6–1). In fact, you will come across that term right now . . .
Figure 6–1 Face Validity and Comedian Rodney Dangerfield Rodney Dangerfield (1921–2004) was famous for complaining, “I don’t get no respect.” Somewhat analogously, the concept of face validity has been described as the “Rodney Dangerfield of psychometric variables” because it has “received little attention—and even less respect—from researchers examining the construct validity of psychological tests and measures” (Bornstein et al., 1994, p. 363). By the way, the tombstone of this beloved stand-up comic and film actor reads: “Rodney Dangerfield . . . There goes the neighborhood.”© Arthur Schatz/The Life Images Collection/Getty Images
Face validity relates more to what a test appears to measure to the person being tested than to what the test actually measures. Face validity is a judgment concerning how relevant the Page 178test items appear to be. Stated another way, if a test definitely appears to measure what it purports to measure “on the face of it,” then it could be said to be high in face validity. A paper-and-pencil personality test labeled The Introversion/Extraversion Test, with items that ask respondents whether they have acted in an introverted or an extraverted way in particular situations, may be perceived by respondents as a highly face-valid test. On the other hand, a personality test in which respondents are asked to report what they see in inkblots may be perceived as a test with low face validity. Many respondents would be left wondering how what they said they saw in the inkblots really had anything at all to do with personality.
In contrast to judgments about the reliability of a test and judgments about the content, construct, or criterion-related validity of a test, judgments about face validity are frequently thought of from the perspective of the testtaker, not the test user. A test’s lack of face validity could contribute to a lack of confidence in the perceived effectiveness of the test—with a consequential decrease in the testtaker’s cooperation or motivation to do his or her best. In a corporate environment, lack of face validity may lead to unwillingness of administrators or managers to “buy-in” to the use of a particular test (see this chapter’s Meet an Assessment Professional ). In a similar vein, parents may object to having their children tested with instruments that lack ostensible validity. Such concern might stem from a belief that the use of such tests will result in invalid conclusions.
MEET AN ASSESSMENT PROFESSIONAL
Meet Dr. Adam Shoemaker
In the “real world,” tests require buy-in from test administrators and candidates. While the reliability and validity of the test are always of primary importance, the test process can be short-circuited by administrators who don’t know how to use the test or who don’t have a good understanding of test theory. So at least half the battle of implementing a new testing tool is to make sure administrators know how to use it, accept the way that it works, and feel comfortable that it is tapping the skills and abilities necessary for the candidate to do the job.
Here’s an example: Early in my company’s history of using online assessments, we piloted a test that had acceptable reliability and criterion validity. We saw some strongly significant correlations between scores on the test and objective performance numbers, suggesting that this test did a good job of distinguishing between high and low performers on the job. The test proved to be unbiased and showed no demonstrable adverse impact against minority groups. However, very few test administrators felt comfortable using the assessment because most people felt that the skills that it tapped were not closely related to the skills needed for the job. Legally, ethically, and statistically, we were on firm ground, but we could never fully achieve “buy-in” from the people who had to administer the test.
On the other hand, we also piloted a test that showed very little criterion validity at all. There were no significant correlations between scores on the test and performance outcomes; the test was unable to distinguish between a high and a low performer. Still . . . the test administrators loved this test because it “looked” so much like the job. That is, it had high face validity and tapped skills that seemed to be precisely the kinds of skills that were needed on the job. From a legal, ethical, and statistical perspective, we knew we could not use this test to select employees, but we continued to use it to provide a “realistic job preview” to candidates. That way, the test continued to work for us in really showing candidates that this was the kind of thing they would be doing all day at work. More than a few times, candidates voluntarily withdrew from the process because they had a better understanding of what the job involved long before they even sat down at a desk.
Adam Shoemaker, Ph.D., Human Resources Consultant for Talent Acquisition, Tampa, Florida © Adam Shoemaker
The moral of this story is that as scientists, we have to remember that reliability and validity are super important in the development and implementation of a test . . . but as human beings, we have to remember that the test we end up using must also be easy to use and appear face valid for both the candidate and the administrator.
Read more of what Dr. Shoemaker had to say—his complete essay—through the Instructor Resources within Connect.
Used with permission of Adam Shoemaker.
JUST THINK . . .
What is the value of face validity from the perspective of the test user?
In reality, a test that lacks face validity may still be relevant and useful. However, if the test is not perceived as relevant and useful by testtakers, parents, legislators, and others, then negative consequences may result. These consequences may range from poor testtaker attitude to lawsuits filed by disgruntled parties against a test user and test publisher. Ultimately, face validity may be more a matter of public relations than psychometric soundness. Still, it is important nonetheless, and (much like Rodney Dangerfield) deserving of respect.
Content validity describes a judgment of how adequately a test samples behavior representative of the universe of behavior that the test was designed to sample. For example, the universe of behavior referred to as assertive is very wide-ranging. A content-valid, paper-and-pencil test of assertiveness would be one that is adequately representative of this wide range. We might expect that such a test would contain items sampling from hypothetical situations at home (such as whether the respondent has difficulty in making her or his views known to fellow family members), on the job (such as whether the respondent has difficulty in asking subordinates to do what is required of them), and in social situations (such as whether the respondent would send back a steak not done to order in a fancy restaurant). Ideally, test developers have a clear (as opposed to “fuzzy”) vision of the construct being measured, and the clarity of this vision can be reflected in the content validity of the test (Haynes et al., 1995). In the interest of ensuring content validity, test developers strive to include key components of the construct targeted for measurement, and exclude content irrelevant to the construct targeted for measurement.
With respect to educational achievement tests, it is customary to consider a test a content-valid measure when the proportion of material covered by the test approximates the proportion of material covered in the course. A cumulative final exam in introductory statistics would be considered content-valid if the proportion and type of introductory statistics problems on the test approximates the proportion and type of introductory statistics problems presented in the course.
The early stages of a test being developed for use in the classroom—be it one classroom or those throughout the state or the nation—typically entail research exploring the universe of possible instructional objectives for the course. Included among the many possible sources of information on such objectives are course syllabi, course textbooks, teachers of the course, specialists who Page 180develop curricula, and professors and supervisors who train teachers in the particular subject area. From the pooled information (along with the judgment of the test developer), there emerges a test blueprint for the “structure” of the evaluation—that is, a plan regarding the types of information to be covered by the items, the number of items tapping each area of coverage, the organization of the items in the test, and so forth (see Figure 6–2). In many instances the test blueprint represents the culmination of efforts to adequately sample the universe of content areas that conceivably could be sampled in such a test.2
Figure 6–2 Building a Test from a Test Blueprint An architect’s blueprint usually takes the form of a technical drawing or diagram of a structure, sometimes written in white lines on a blue background. The blueprint may be thought of as a plan of a structure, typically detailed enough so that the structure could actually be constructed from it. Somewhat comparable to the architect’s blueprint is the test blueprint of a test developer. Seldom, if ever, on a blue background and written in white, it is nonetheless a detailed plan of the content, organization, and quantity of the items that a test will contain—sometimes complete with “weightings” of the content to be covered (He, 2011; Spray & Huang, 2000; Sykes & Hou, 2003). A test administered on a regular basis may require “item-pool management” to manage the creation of new items and the output of old items in a manner that is consistent with the test’s blueprint (Ariel et al., 2006; van der Linden et al., 2000).© John Rowley/Getty Images RF
JUST THINK . . .
A test developer is working on a brief screening instrument designed to predict student success in a psychological testing and assessment course. You are the consultant called upon to blueprint the content areas covered. Your recommendations?
For an employment test to be content-valid, its content must be a representative sample of the job-related skills required for employment. Behavioral observation is one technique frequently used in blueprinting the content areas to be covered in certain types of employment tests. The test developer will observe successful veterans on that job, note the behaviors necessary for success on the job, and design the test to include a representative Page 181sample of those behaviors. Those same workers (as well as their supervisors and others) may subsequently be called on to act as experts or judges in rating the degree to which the content of the test is a representative sample of the required job-related skills. At that point, the test developer will want to know about the extent to which the experts or judges agree. A description of one such method for quantifying the degree of agreement between such raters can be found “online only” through the Instructor Resources within Connect (refer to OOBAL-6-B2).
Culture and the relativity of content validity
Tests are often thought of as either valid or not valid. A history test, for example, either does or does not accurately measure one’s knowledge of historical fact. However, it is also true that what constitutes historical fact depends to some extent on who is writing the history. Consider, for example, a momentous event in the history of the world, one that served as a catalyst for World War I. Archduke Franz Ferdinand was assassinated on June 28, 1914, by a Serb named Gavrilo Princip (Figure 6–3). Now think about how you would answer the following multiple-choice item on a history test:
Figure 6–3 Cultural Relativity, History, and Test Validity Austro-Hungarian Archduke Franz Ferdinand and his wife, Sophia, are pictured (left) as they left Sarajevo’s City Hall on June 28, 1914. Moments later, Ferdinand would be assassinated by Gavrilo Princip, shown in custody at right. The killing served as a catalyst for World War I and is discussed and analyzed in history textbooks in every language around the world. Yet descriptions of the assassin Princip in those textbooks—and ability test items based on those descriptions—vary as a function of culture.© Ingram Publishing RF
Gavrilo Princip was
a. a poet
b. a hero
c. a terrorist
d. a nationalist
e. all of the above
For various textbooks in the Bosnian region of the world, choice “e”—that’s right, “all of the above”—is the “correct” answer. Hedges (1997) observed that textbooks in areas of Bosnia and Herzegovina that were controlled by different ethnic groups imparted widely varying characterizations of the assassin. In the Serb-controlled region of the country, history textbooks—and presumably the tests constructed to measure students’ learning—regarded Princip as a “hero and poet.” By contrast, Croatian students might read that Princip was an assassin trained to commit a terrorist act. Muslims in the region were taught that Princip was a nationalist whose deed sparked anti-Serbian rioting.
JUST THINK . . .
The passage of time sometimes serves to place historical figures in a different light. How might the textbook descriptions of Gavrilo Princip have changed in these regions?
A history test considered valid in one classroom, at one time, and in one place will not necessarily be considered so in another classroom, at another time, and in another place. Consider a test containing the true-false item, “Colonel Claus von Stauffenberg is a hero.” Such an item is useful in illustrating the cultural relativity affecting item scoring. In 1944, von Stauffenberg, a German officer, was an active participant in a bomb plot to assassinate Germany’s leader, Adolf Hitler. When the plot (popularized in the film, Operation Valkyrie) failed, von Stauffenberg was executed and promptly villified in Germany as a despicable traitor. Today, the light of history shines favorably on von Stauffenberg, and he is perceived as a hero in Germany. A German postage stamp with his face on it was issued to honor von Stauffenberg’s 100th birthday.
Politics is another factor that may well play a part in perceptions and judgments concerning the validity of tests and test items. In many countries throughout the world, a response that is keyed incorrect to a particular test item can lead to consequences far more dire than a deduction in points towards the total test score. Sometimes, even constructing a test with a reference to a taboo topic can have dire consequences for the test developer. For example, one Palestinian professor who included items pertaining to governmental corruption on an examination was tortured by authorities as a result (“Brother Against Brother,” 1997). Such scenarios bring new meaning to the term politically correct as it applies to tests, test items, and testtaker responses.
JUST THINK . . .
Commercial test developers who publish widely used history tests must maintain the content validity of their tests. What challenges do they face in doing so?
Criterion-related validity is a judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest—the measure of interest being the criterion. Two types of validity evidence are subsumed under the heading criterion-related validity. Concurrent validity is an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently). Predictive validity is an index of the degree to which a test score predicts some criterion measure. Before we discuss each of these types of validity evidence in detail, it seems appropriate to raise (and answer) an important question.
What Is a Criterion?
We were first introduced to the concept of a criterion in Chapter 4, where, in the context of defining criterion-referenced assessment, we defined a criterion broadly as a standard on which a judgment or decision may be based. Here, in the context of our discussion of criterion-related validity, we will define a criterion just a bit more narrowly as the standard against which a test Page 183or a test score is evaluated. So, for example, if a test purports to measure the trait of athleticism, we might expect to employ “membership in a health club” or any generally accepted measure of physical fitness as a criterion in evaluating whether the athleticism test truly measures athleticism. Operationally, a criterion can be most anything: pilot performance in flying a Boeing 767, grade on examination in Advanced Hairweaving, number of days spent in psychiatric hospitalization; the list is endless. There are no hard-and-fast rules for what constitutes a criterion. It can be a test score, a specific behavior or group of behaviors, an amount of time, a rating, a psychiatric diagnosis, a training cost, an index of absenteeism, an index of alcohol intoxication, and so on. Whatever the criterion, ideally it is relevant, valid, and uncontaminated. Let’s explain.
Characteristics of a criterion
An adequate criterion is relevant. By this we mean that it is pertinent or applicable to the matter at hand. We would expect, for example, that a test purporting to advise testtakers whether they share the same interests of successful actors to have been validated using the interests of successful actors as a criterion.
An adequate criterion measure must also be valid for the purpose for which it is being used. If one test (X) is being used as the criterion to validate a second test (Y), then evidence should exist that test X is valid. If the criterion used is a rating made by a judge or a panel, then evidence should exist that the rating is valid. Suppose, for example, that a test purporting to measure depression is said to have been validated using as a criterion the diagnoses made by a blue-ribbon panel of psychodiagnosticians. A test user might wish to probe further regarding variables such as the credentials of the “blue-ribbon panel” (or, their educational background, training, and experience) and the actual procedures used to validate a diagnosis of depression. Answers to such questions would help address the issue of whether the criterion (in this case, the diagnoses made by panel members) was indeed valid.
Ideally, a criterion is also uncontaminated. Criterion contamination is the term applied to a criterion measure that has been based, at least in part, on predictor measures. As an example, consider a hypothetical “Inmate Violence Potential Test” (IVPT) designed to predict a prisoner’s potential for violence in the cell block. In part, this evaluation entails ratings from fellow inmates, guards, and other staff in order to come up with a number that represents each inmate’s violence potential. After all of the inmates in the study have been given scores on this test, the study authors then attempt to validate the test by asking guards to rate each inmate on their violence potential. Because the guards’ opinions were used to formulate the inmate’s test score in the first place (the predictor variable), the guards’ opinions cannot be used as a criterion against which to judge the soundness of the test. If the guards’ opinions were used both as a predictor and as a criterion, then we would say that criterion contamination had occurred.
Here is another example of criterion contamination. Suppose that a team of researchers from a company called Ventura International Psychiatric Research (VIPR) just completed a study of how accurately a test called the MMPI-2-RF predicted psychiatric diagnosis in the psychiatric population of the Minnesota state hospital system. As we will see in Chapter 12, the MMPI-2-RF is, in fact, a widely used test. In this study, the predictor is the MMPI-2-RF, and the criterion is the psychiatric diagnosis that exists in the patient’s record. Further, let’s suppose that while all the data are being analyzed at VIPR headquarters, someone informs these researchers that the diagnosis for every patient in the Minnesota state hospital system was determined, at least in part, by an MMPI-2-RF test score. Should they still proceed with their analysis? The answer is no. Because the predictor measure has contaminated the criterion measure, it would be of little value to find, in essence, that the predictor can indeed predict itself.
When criterion contamination does occur, the results of the validation study cannot be taken seriously. There are no methods or statistics to gauge the extent to which criterion contamination has taken place, and there are no methods or statistics to correct for such contamination.
Now, let’s take a closer look at concurrent validity and predictive validity.Page 184
If test scores are obtained at about the same time as the criterion measures are obtained, measures of the relationship between the test scores and the criterion provide evidence of concurrent validity. Statements of concurrent validity indicate the extent to which test scores may be used to estimate an individual’s present standing on a criterion. If, for example, scores (or classifications) made on the basis of a psychodiagnostic test were to be validated against a criterion of already diagnosed psychiatric patients, then the process would be one of concurrent validation. In general, once the validity of the inference from the test scores is established, the test may provide a faster, less expensive way to offer a diagnosis or a classification decision. A test with satisfactorily demonstrated concurrent validity may therefore be appealing to prospective users because it holds out the potential of savings of money and professional time.
Sometimes the concurrent validity of a particular test (let’s call it Test A) is explored with respect to another test (we’ll call Test B). In such studies, prior research has satisfactorily demonstrated the validity of Test B, so the question becomes: “How well does Test A compare with Test B?” Here, Test B is used as the validating criterion. In some studies, Test A is either a brand-new test or a test being used for some new purpose, perhaps with a new population.
Here is a real-life example of a concurrent validity study in which a group of researchers explored whether a test validated for use with adults could be used with adolescents. The Beck Depression Inventory (BDI; Beck et al., 1961, 1979; Beck & Steer, 1993) and its revision, the Beck Depression Inventory-II (BDI-II; Beck et al., 1996) are self-report measures used to identify symptoms of depression and quantify their severity. Although the BDI had been widely used with adults, questions were raised regarding its appropriateness for use with adolescents. Ambrosini et al. (1991) conducted a concurrent validity study to explore the utility of the BDI with adolescents. They also sought to determine if the test could successfully differentiate patients with depression from those without depression in a population of adolescent outpatients. Diagnoses generated from the concurrent administration of an instrument previously validated for use with adolescents were used as the criterion validators. The findings suggested that the BDI is valid for use with adolescents.
JUST THINK . . .
What else might these researchers have done to explore the utility of the BDI with adolescents?
We now turn our attention to another form of criterion validity, one in which the criterion measure is obtained not concurrently but at some future time.
Test scores may be obtained at one time and the criterion measures obtained at a future time, usually after some intervening event has taken place. The intervening event may take varied forms, such as training, experience, therapy, medication, or simply the passage of time. Measures of the relationship between the test scores and a criterion measure obtained at a future time provide an indication of the predictive validity of the test; that is, how accurately scores on the test predict some criterion measure. Measures of the relationship between college admissions tests and freshman grade point averages, for example, provide evidence of the predictive validity of the admissions tests.
In settings where tests might be employed—such as a personnel agency, a college admissions office, or a warden’s office—a test’s high predictive validity can be a useful aid to decision-makers who must select successful students, productive workers, or good parole risks. Whether a test result is valuable in decision making depends on how well the test results improve selection decisions over decisions made without knowledge of test results. In an Page 185industrial setting where volume turnout is important, if the use of a personnel selection test can enhance productivity to even a small degree, then that enhancement will pay off year after year and may translate into millions of dollars of increased revenue. And in a clinical context, no price could be placed on a test that could save more lives from suicide or by providing predictive accuracy over and above existing tests with respect to such acts. Unfortunately, the difficulties inherent in developing such tests are numerous and multifaceted (Mulvey & Lidz, 1984; Murphy, 1984; Petrie & Chamberlain, 1985). When evaluating the predictive validity of a test, researchers must take into consideration the base rate of the occurrence of the variable in question, both as that variable exists in the general population and as it exists in the sample being studied. Generally, a base rate is the extent to which a particular trait, behavior, characteristic, or attribute exists in the population (expressed as a proportion). In psychometric parlance, a hit rate may be defined as the proportion of people a test accurately identifies as possessing or exhibiting a particular trait, behavior, characteristic, or attribute. For example, hit rate could refer to the proportion of people accurately predicted to be able to perform work at the graduate school level or to the proportion of neurological patients accurately identified as having a brain tumor. In like fashion, a miss rate may be defined as the proportion of people the test fails to identify as having, or not having, a particular characteristic or attribute. Here, a miss amounts to an inaccurate prediction. The category of misses may be further subdivided. A false positive is a miss wherein the test predicted that the testtaker did possess the particular characteristic or attribute being measured when in fact the testtaker did not. A false negative is a miss wherein the test predicted that the testtaker did not possess the particular characteristic or attribute being measured when the testtaker actually did.
To evaluate the predictive validity of a test, a test targeting a particular attribute may be administered to a sample of research subjects in which approximately half of the subjects possess or exhibit the targeted attribute and the other half do not. Evaluating the predictive validity of a test is essentially a matter of evaluating the extent to which use of the test results in an acceptable hit rate.
Judgments of criterion-related validity, whether concurrent or predictive, are based on two types of statistical evidence: the validity coefficient and expectancy data.
The validity coefficient
The validity coefficient is a correlation coefficient that provides a measure of the relationship between test scores and scores on the criterion measure. The correlation coefficient computed from a score (or classification) on a psychodiagnostic test and the criterion score (or classification) assigned by psychodiagnosticians is one example of a validity coefficient. Typically, the Pearson correlation coefficient is used to determine the validity between the two measures. However, depending on variables such as the type of data, the sample size, and the shape of the distribution, other correlation coefficients could be used. For example, in correlating self-rankings of performance on some job with rankings made by job supervisors, the formula for the Spearman rho rank-order correlation would be employed.
Like the reliability coefficient and other correlational measures, the validity coefficient is affected by restriction or inflation of range. And as in other correlational studies, a key issue is whether the range of scores employed is appropriate to the objective of the correlational analysis. In situations where, for example, attrition in the number of subjects has occurred over the course of the study, the validity coefficient may be adversely affected.
The problem of restricted range can also occur through a self-selection process in the sample employed for the validation study. Thus, for example, if the test purports to measure something as technical or as dangerous as oil-barge firefighting skills, it may well be that the only people who reply to an ad for the position of oil-barge firefighter are those who are actually highly qualified for the position. Accordingly, the range of the distribution of scores on this test of oil-barge firefighting skills would be restricted. For less technical or dangerous positions, a self-selection factor might be operative if the test developer selects a group of Page 186newly hired employees to test (with the expectation that criterion measures will be available for this group at some subsequent date). However, because the newly hired employees have probably already passed some formal or informal evaluation in the process of being hired, there is a good chance that ability to do the job will be higher among this group than among a random sample of ordinary job applicants. Consequently, scores on the criterion measure that is later administered will tend to be higher than scores on the criterion measure obtained from a random sample of ordinary job applicants. Stated another way, the scores will be restricted in range.
Whereas it is the responsibility of the test developer to report validation data in the test manual, it is the responsibility of test users to read carefully the description of the validation study and then to evaluate the suitability of the test for their specific purposes. What were the characteristics of the sample used in the validation study? How matched are those characteristics to the people for whom an administration of the test is contemplated? For a specific test purpose, are some subtests of a test more appropriate than the entire test?
How high should a validity coefficient be for a user or a test developer to infer that the test is valid? There are no rules for determining the minimum acceptable size of a validity coefficient. In fact, Cronbach and Gleser (1965) cautioned against the establishment of such rules. They argued that validity coefficients need to be large enough to enable the test user to make accurate decisions within the unique context in which a test is being used. Essentially, the validity coefficient should be high enough to result in the identification and differentiation of testtakers with respect to target attribute(s), such as employees who are likely to be more productive, police officers who are less likely to misuse their weapons, and students who are more likely to be successful in a particular course of study.
Test users involved in predicting some criterion from test scores are often interested in the utility of multiple predictors. The value of including more than one predictor depends on a couple of factors. First, of course, each measure used as a predictor should have criterion-related predictive validity. Second, additional predictors should possess incremental validity , defined here as the degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use.
Incremental validity may be used when predicting something like academic success in college. Grade point average (GPA) at the end of the first year may be used as a measure of academic success. A study of potential predictors of GPA may reveal that time spent in the library and time spent studying are highly correlated with GPA. How much sleep a student’s roommate allows the student to have during exam periods correlates with GPA to a smaller extent. What is the most accurate but most efficient way to predict GPA? One approach, employing the principles of incremental validity, is to start with the best predictor: the predictor that is most highly correlated with GPA. This may be time spent studying. Then, using multiple regression techniques, one would examine the usefulness of the other predictors.
Even though time in the library is highly correlated with GPA, it may not possess incremental validity if it overlaps too much with the first predictor, time spent studying. Said another way, if time spent studying and time in the library are so highly correlated with each other that they reflect essentially the same thing, then only one of them needs to be included as a predictor. Including both predictors will provide little new information. By contrast, the variable of how much sleep a student’s roommate allows the student to have during exams may have good incremental validity. This is so because it reflects a different aspect of preparing for exams (resting) from the first predictor (studying). Incremental validity has been used to improve the prediction of job performance for Marine Corps mechanics (Carey, 1994) and the prediction of child abuse (Murphy-Berman, 1994). In both instances, predictor measures were included only if they demonstrated that they could explain something about the criterion measure that was not already known from the other predictors.Page 187
Construct validity is a judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a variable called a construct. A construct is an informed, scientific idea developed or hypothesized to describe or explain behavior. Intelligence is a construct that may be invoked to describe why a student performs well in school. Anxiety is a construct that may be invoked to describe why a psychiatric patient paces the floor. Other examples of constructs are job satisfaction, personality, bigotry, clerical aptitude, depression, motivation, self-esteem, emotional adjustment, potential dangerousness, executive potential, creativity, and mechanical comprehension, to name but a few.
Constructs are unobservable, presupposed (underlying) traits that a test developer may invoke to describe test behavior or criterion performance. The researcher investigating a test’s construct validity must formulate hypotheses about the expected behavior of high scorers and low scorers on the test. These hypotheses give rise to a tentative theory about the nature of the construct the test was designed to measure. If the test is a valid measure of the construct, then high scorers and low scorers will behave as predicted by the theory. If high scorers and low scorers on the test do not behave as predicted, the investigator will need to reexamine the nature of the construct itself or hypotheses made about it. One possible reason for obtaining results contrary to those predicted by the theory is that the test simply does not measure the construct. An alternative explanation could lie in the theory that generated hypotheses about the construct. The theory may need to be reexamined.
In some instances, the reason for obtaining contrary findings can be traced to the statistical procedures used or to the way the procedures were executed. One procedure may have been more appropriate than another, given the particular assumptions. Thus, although confirming evidence contributes to a judgment that a test is a valid measure of a construct, evidence to the contrary can also be useful. Contrary evidence can provide a stimulus for the discovery of new facets of the construct as well as alternative methods of measurement.
Traditionally, construct validity has been viewed as the unifying concept for all validity evidence (American Educational Research Association et al., 1999). As we noted at the outset, all types of validity evidence, including evidence from the content- and criterion-related varieties of validity, come under the umbrella of construct validity. Let’s look at the types of evidence that might be gathered.
Evidence of Construct Validity
A number of procedures may be used to provide different kinds of evidence that a test has construct validity. The various techniques of construct validation may provide evidence, for example, that
· the test is homogeneous, measuring a single construct;
· test scores increase or decrease as a function of age, the passage of time, or an experimental manipulation as theoretically predicted;
· test scores obtained after some event or the mere passage of time (or, posttest scores) differ from pretest scores as theoretically predicted;
· test scores obtained by people from distinct groups vary as predicted by the theory;
· test scores correlate with scores on other tests in accordance with what would be predicted from a theory that covers the manifestation of the construct in question.
A brief discussion of each type of construct validity evidence and the procedures used to obtain it follows.
Evidence of homogeneity
When describing a test and its items, homogeneity refers to how uniform a test is in measuring a single concept. A test developer can increase test homogeneity in several ways. Consider, for example, a test of academic achievement that contains subtests in areas Page 188such as mathematics, spelling, and reading comprehension. The Pearson r could be used to correlate average subtest scores with the average total test score. Subtests that in the test developer’s judgment do not correlate very well with the test as a whole might have to be reconstructed (or eliminated) lest the test not measure the construct academic achievement. Correlations between subtest scores and total test score are generally reported in the test manual as evidence of homogeneity.
One way a test developer can improve the homogeneity of a test containing items that are scored dichotomously (such as a true-false test) is by eliminating items that do not show significant correlation coefficients with total test scores. If all test items show significant, positive correlations with total test scores and if high scorers on the test tend to pass each item more than low scorers do, then each item is probably measuring the same construct as the total test. Each item is contributing to test homogeneity.
The homogeneity of a test in which items are scored on a multipoint scale can also be improved. For example, some attitude and opinion questionnaires require respondents to indicate level of agreement with specific statements by responding, for example, strongly agree, agree, disagree, or strongly disagree.Each response is assigned a numerical score, and items that do not show significant Spearman rank-order correlation coefficients are eliminated. If all test items show significant, positive correlations with total test scores, then each item is most likely measuring the same construct that the test as a whole is measuring (and is thereby contributing to the test’s homogeneity). Coefficient alpha may also be used in estimating the homogeneity of a test composed of multiple-choice items (Novick & Lewis, 1967).
As a case study illustrating how a test’s homogeneity can be improved, consider the Marital Satisfaction Scale (MSS; Roach et al., 1981). Designed to assess various aspects of married people’s attitudes toward their marital relationship, the MSS contains an approximately equal number of items expressing positive and negative sentiments with respect to marriage. For example, My life would seem empty without my marriage and My marriage has “smothered” my personality. In one stage of the development of this test, subjects indicated how much they agreed or disagreed with the various sentiments in each of 73 items by marking a 5-point scale that ranged from strongly agree to strongly disagree. Based on the correlations between item scores and total score, the test developers elected to retain 48 items with correlation coefficients greater than .50, thus creating a more homogeneous instrument.
Item-analysis procedures have also been employed in the quest for test homogeneity. One item-analysis procedure focuses on the relationship between testtakers’ scores on individual items and their score on the entire test. Each item is analyzed with respect to how high scorers versus low scorers responded to it. If it is an academic test and if high scorers on the entire test for some reason tended to get that particular item wrong while low scorers on the test as a whole tended to get the item right, the item is obviously not a good one. The item should be eliminated in the interest of test homogeneity, among other considerations. If the test is one of marital satisfaction, and if individuals who score high on the test as a whole respond to a particular item in a way that would indicate that they are not satisfied whereas people who tend not to be satisfied respond to the item in a way that would indicate that they are satisfied, then again the item should probably be eliminated or at least reexamined for clarity.
JUST THINK . . .
Is it possible for a test to be too homogeneous in item content?
Although test homogeneity is desirable because it assures us that all the items on the test tend to be measuring the same thing, it is not the be-all and end-all of construct validity. Knowing that a test is homogeneous contributes no information about how the construct being measured relates to other constructs. It is therefore important to report evidence of a test’s homogeneity along with other evidence of construct validity.
Evidence of changes with age
Some constructs are expected to change over time. Reading rate, for example, tends to increase dramatically year by year from age 6 to the early teens. If a test score purports to be a measure of a construct that could be expected to change over time, then the Page 189test score, too, should show the same progressive changes with age to be considered a valid measure of the construct. For example, if children in grades 6, 7, 8, and 9 took a test of eighth-grade vocabulary, then we would expect that the total number of items scored as correct from all the test protocols would increase as a function of the higher grade level of the testtakers.
Some constructs lend themselves more readily than others to predictions of change over time. Thus, although we may be able to predict that a gifted child’s scores on a test of reading skills will increase over the course of the testtaker’s years of elementary and secondary education, we may not be able to predict with such confidence how a newlywed couple will score through the years on a test of marital satisfaction. This fact does not relegate a construct such as marital satisfaction to a lower stature than reading ability. Rather, it simply means that measures of marital satisfaction may be less stable over time or more vulnerable to situational events (such as in-laws coming to visit and refusing to leave for three months) than is reading ability. Evidence of change over time, like evidence of test homogeneity, does not in itself provide information about how the construct relates to other constructs.
Evidence of pretest–posttest changes
Evidence that test scores change as a result of some experience between a pretest and a posttest can be evidence of construct validity. Some of the more typical intervening experiences responsible for changes in test scores are formal education, a course of therapy or medication, and on-the-job experience. Of course, depending on the construct being measured, almost any intervening life experience could be predicted to yield changes in score from pretest to posttest. Reading an inspirational book, watching a TV talk show, undergoing surgery, serving a prison sentence, or the mere passage of time may each prove to be a potent intervening variable.
Returning to our example of the Marital Satisfaction Scale, one investigator cited in Roach et al. (1981) compared scores on that instrument before and after a sex therapy treatment program. Scores showed a significant change between pretest and posttest. A second posttest given eight weeks later showed that scores remained stable (suggesting the instrument was reliable), whereas the pretest–posttest measures were still significantly different. Such changes in scores in the predicted direction after the treatment program contribute to evidence of the construct validity for this test.
JUST THINK . . .
Might it have been advisable to have simultaneous testing of a matched group of couples who did not participate in sex therapy and simultaneous testing of a matched group of couples who did not consult divorce attorneys? In both instances, would there have been any reason to expect any significant changes in the test scores of these two control groups?
We would expect a decline in marital satisfaction scores if a pretest were administered to a sample of couples shortly after they took their nuptial vows and a posttest were administered shortly after members of the couples consulted their respective divorce attorneys sometime within the first five years of marriage. The experimental group in this study would consist of couples who consulted a divorce attorney within the first five years of marriage. The design of such pretest–posttest research ideally should include a control group to rule out alternative explanations of the findings.
Evidence from distinct groups
Also referred to as the method of contrasted groups , one way of providing evidence for the validity of a test is to demonstrate that scores on the test vary in a predictable way as a function of membership in some group. The rationale here is that if a test is a valid measure of a particular construct, then test scores from groups of people who would be presumed to differ with respect to that construct should have correspondingly different test scores. Consider in this context a test of depression wherein the higher the test score, the more depressed the testtaker is presumed to be. We would expect individuals psychiatrically hospitalized for depression to score higher on this measure than a random sample of Walmart shoppers.
Now, suppose it was your intention to provide construct validity evidence for the Marital Satisfaction Scale by showing differences in scores between distinct groups. How might you go about doing that?Page 190
Roach and colleagues (1981) proceeded by identifying two groups of married couples, one relatively satisfied in their marriage, the other not so satisfied. The groups were identified by ratings by peers and professional marriage counselors. A t test on the difference between mean score on the test was significant ( p < .01)—evidence to support the notion that the Marital Satisfaction Scale is indeed a valid measure of the construct marital satisfaction.
In a bygone era, the method many test developers used to create distinct groups was deception. For example, if it had been predicted that more of the construct would be exhibited on the test in question if the subject felt highly anxious, an experimental situation might be designed to make the subject feel highly anxious. Virtually any feeling state the theory called for could be induced by an experimental scenario that typically involved giving the research subject some misinformation. However, given the ethical constraints of contemporary psychologists and the reluctance of academic institutions and other sponsors of research to condone deception in human research, the method of obtaining distinct groups by creating them through the dissemination of deceptive information is frowned upon (if not prohibited) today.
Evidence for the construct validity of a particular test may converge from a number of sources, such as other tests or measures designed to assess the same (or a similar) construct. Thus, if scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, and already validated tests designed to measure the same (or a similar) construct, this would be an example of convergent evidence . 3
Convergent evidence for validity may come not only from correlations with tests purporting to measure an identical construct but also from correlations with measures purporting to measure related constructs. Consider, for example, a new test designed to measure the construct test anxiety. Generally speaking, we might expect high positive correlations between this new test and older, more established measures of test anxiety. However, we might also expect more moderate correlations between this new test and measures of general anxiety.
Roach et al. (1981) provided convergent evidence of the construct validity of the Marital Satisfaction Scale by computing a validity coefficient between scores on it and scores on the Marital Adjustment Test (Locke & Wallace, 1959). The validity coefficient of .79 provided additional evidence of their instrument’s construct validity.
A validity coefficient showing little (a statistically insignificant) relationship between test scores and/or other variables with which scores on the test being construct-validated should not theoretically be correlated provides discriminant evidence of construct validity (also known as discriminant validity). In the course of developing the Marital Satisfaction Scale (MSS), its authors correlated scores on that instrument with scores on the Marlowe-Crowne Social Desirability Scale (Crowne & Marlowe, 1964). Roach et al. (1981) hypothesized that high correlations between these two instruments would suggest that respondents were probably not answering items on the MSS entirely honestly but instead were responding in socially desirable ways. But the correlation between the MSS and the social desirability measure did not prove to be significant, so the test developers concluded that social desirability could be ruled out as a primary factor in explaining the meaning of MSS test scores.
In 1959 an experimental technique useful for examining both convergent and discriminant validity evidence was presented in Psychological Bulletin.This rather technical procedure was called the multitrait-multimethod matrix . A detailed description of it, along with an Page 191illustration, can be found in OOBAL-6-B1. Here, let’s simply point out that multitrait means “two or more traits” and multimethod means “two or more methods.” The multitrait-multimethod matrix (Campbell & Fiske, 1959) is the matrix or table that results from correlating variables (traits) within and between methods. Values for any number of traits (such as aggressiveness or extraversion) as obtained by various methods (such as behavioral observation or a personality test) are inserted into the table, and the resulting matrix of correlations provides insight with respect to both the convergent and the discriminant validity of the methods used.4
Both convergent and discriminant evidence of construct validity can be obtained by the use of factor analysis. Factor analysis is a shorthand term for a class of mathematical procedures designed to identify factors or specific variables that are typically attributes, characteristics, or dimensions on which people may differ. In psychometric research, factor analysis is frequently employed as a data reduction method in which several sets of scores and the correlations between them are analyzed. In such studies, the purpose of the factor analysis may be to identify the factor or factors in common between test scores on subscales within a particular test, or the factors in common between scores on a series of tests. In general, factor analysis is conducted on either an exploratory or a confirmatory basis. Exploratory factor analysis typically entails “estimating, or extracting factors; deciding how many factors to retain; and rotating factors to an interpretable orientation” (Floyd & Widaman, 1995, p. 287). By contrast, in confirmatory factor analysis , researchers test the degree to which a hypothetical model (which includes factors) fits the actual data.
A term commonly employed in factor analysis is factor loading , which is “a sort of metaphor. Each test is thought of as a vehicle carrying a certain amount of one or more abilities” (Tyler, 1965, p. 44). Factor loading in a test conveys information about the extent to which the factor determines the test score or scores. A new test purporting to measure bulimia, for example, can be factor-analyzed with other known measures of bulimia, as well as with other kinds of measures (such as measures of intelligence, self-esteem, general anxiety, anorexia, or perfectionism). High factor loadings by the new test on a “bulimia factor” would provide convergent evidence of construct validity. Moderate to low factor loadings by the new test with respect to measures of other eating disorders such as anorexia would provide discriminant evidence of construct validity.
Factor analysis frequently involves technical procedures so complex that few contemporary researchers would attempt to conduct one without the aid of sophisticated software. But although the actual data analysis has become work for computers, humans still tend to be very much involved in the naming of factors once the computer has identified them. Thus, for example, suppose a factor analysis identified a common factor being measured by two hypothetical instruments, a “Bulimia Test” and an “Anorexia Test.” This common factor would have to be named. One factor analyst looking at the data and the items of each test might christen the common factor an eating disorder factor. Another factor analyst examining exactly the same materials might label the common factor a body weight preoccupation factor. A third analyst might name the factor a self-perception disorder factor. Which of these is correct?
From a statistical perspective, it is simply impossible to say what the common factor should be named. Naming factors that emerge from a factor analysis has more to do with knowledge, judgment, and verbal abstraction ability than with mathematical expertise. There are no hard-and-fast rules. Factor analysts exercise their own judgment about what factor name best communicates the meaning of the factor. Further, even the criteria used to identify a common factor, as well as related technical matters, can be a matter of debate, if not heated controversy.Page 192
Factor analysis is a subject rich in technical complexity. Its uses and applications can vary as a function of the research objectives as well as the nature of the tests and the constructs under study. Factor analysis is the subject of our Close-Up in Chapter 9. More immediately, our Close-Up here brings together much of the information imparted so far in this chapter to provide a “real life” example of the test validation process.
The Preliminary Validation of a Measure of Individual Differences in Constructive Versus Unconstructive Worry*
Establishing validity is an important step in the development of new psychological measures. The development of a questionnaire that measures individual differences in worry called the Constructive and Unconstructive Worry Questionnaire (CUWQ; McNeill & Dunlop, 2016) provides an illustration of some of the steps in the test validation process.
Prior to the development of this questionnaire, research on worry had shown that the act of worrying can lead to both positive outcomes (such as increased work performance; Perkins & Corr, 2005) and negative outcomes (such as insomnia; Carney & Waters, 2006). Importantly, findings suggested that the types of worrying thoughts that lead to positive outcomes (which are referred to by the test authors as constructive worry) may differ from the types of worrying thoughts that lead to negative outcomes (referred to as unconstructive worry). However, a review of existing measures of individual differences in worry suggested that none of the measures were made to distinguish people’s tendency to worry constructively from their tendency to worry unconstructively. Since the ability to determine whether individuals are predominantly worrying constructively or unconstructively holds diagnostic and therapeutic benefits, the test authors set out to fill this gap and develop a new questionnaire that would be able to capture both these dimensions of the worry construct.
During the first step of questionnaire development, the creation of an item pool, it was important to ensure the questionnaire would have good content validity. That is, the items would need to adequately sample the variety of characteristics of constructive and unconstructive worry. Based on the test authors’ definition of these two constructs, a literature review was conducted and a list of potential characteristics of constructive versus unconstructive worry was created. This list of characteristics was used to develop a pool of 40 items. These 40 items were cross checked by each author, as well as one independent expert, to ensure that each item was unique and concise. A review of the list as a whole was conducted to ensure that it covered the full range of characteristics identified by the literature review. This process resulted in the elimination of 11 of the initial items, leaving a pool of 29 items. Of the 29 items in total, 13 items were expected to measure the tendency to worry constructively, and the remaining 16 items were expected to measure the tendency to worry unconstructively.
Next, drawing from the theoretical background behind the test authors’ definition of constructive and unconstructive worry, a range of criteria that should be differentially related to one’s tendency to worry constructively versus unconstructively were selected. More specifically, it was hypothesized that the tendency to worry unconstructively would be positively related to trait-anxiety (State Trait Anxiety Inventory (STAI-T); Spielberger et al., 1970) and amount of worry one experiences (e.g., Worry Domains Questionnaire (WDQ); Stöber & Joormann, 2001). In addition, this tendency to worry unconstructively was hypothesized to be negatively related to one’s tendency to be punctual and one’s actual performance of risk-mitigating behaviors. The tendency to worry constructively, on the other hand, was hypothesized to be negatively related to trait-anxiety and amount of worry, and positively related to one’s tendency to be punctual and one’s performance of risk-mitigating behaviors. Identification of these criteria prior to data collection would pave the way for the test authors to conduct an evaluation of the questionnaire’s criterion-based construct-validity in the future.
Upon completion of item pool construction and criterion identification, two studies were conducted. In Study 1, data from 295 participants from the United States was collected on the 29 newly developed worry items, plus two criterion-based measures, namely trait-anxiety and punctuality. An exploratory factor analysis was conducted, and the majority of the 29 items grouped together into a two-factor solution (as expected). The items predicted to capture a tendency to worry constructively loaded strongly on one factor, and the items predicted to capture a tendency to worry unconstructively loaded strongly on the other factor. However, 11 out of the original 29 items either did not load strongly on either factor, or they cross-loaded onto the other factor to a moderate extent. To increase construct validity through increased homogeneity of the two scales, these 11 items were removed from the final version of the questionnaire. The 18 items that remained included eight that primarily loaded on the factor labeled as constructive worry and ten that primarily loaded on the factor labeled as unconstructive worry.
A confirmatory factor analysis on these 18 items showed a good model fit. However, this analysis does not prove that these two factors actually captured the tendencies to worry constructively and unconstructively. To test the construct validity of these factor scores, the relations of the unconstructive and constructive worry factors with both trait-anxiety (Spielberger et al., 1970) and the tendency to be punctual were examined. Results supported the hypotheses and supported an assumption of criterion-based construct validity. That is, as hypothesized, scores on the constructive worry factor were negatively associated with trait-anxiety and positively associated with the tendency to be punctual. Scores on the Unconstructive Worry factor were positively associated with trait-anxiety and negatively associated with the tendency to be punctual.
To further test the construct validity of this newly developed measure, a second study was conducted. In Study 2, data from 998 Australian residents of wildfire-prone areas responded to the 18 (final) worry items from Study 1, plus two additional items, respectively, capturing two additional criteria. These two additional criteria were (1) the amount of worry one tends to experience as captured by two existing worry questionnaires, namely the Worry Domains Questionnaire (Stöber & Joormann, 2001) and the Penn State Worry Questionnaire (Meyer et al., 1990), and (2) the performance of risk-mitigating behaviors that reduce the risk of harm or property damage resulting from a potential wildfire threat. A confirmatory factor analysis on this second data set supported the notion that constructive worry versus unconstructive worry items were indeed capturing separate constructs in a homogenous manner. Furthermore, as hypothesized, the constructive worry factor was positively associated with the performance of wildfire risk-mitigating behaviors, and negatively associated with the amount of worry one experiences. The unconstructive worry factor, on the other hand, was negatively associated with the performance of wildfire risk-mitigating behaviors, and positively associated with the amount of worry one experiences. This provided further criterion-based construct validity.
There are several ways in which future studies could provide additional evidence of construct validity of the CUWQ. For one, both studies reported above looked at the two scales’ concurrent criterion-based validity, but not at their predictive criterion-based validity. Future studies could focus on filling this gap. For example, since both constructs are hypothesized to predict the experience of anxiety (which was confirmed by the scales’ relationships with trait-anxiety in Study 1), they should predict the likelihood of an individual being diagnosed with an anxiety disorder in the future, with unconstructive worry being a positive predictor and constructive worry being a negative predictor. Furthermore, future studies could provide additional evidence of construct validity by testing whether interventions, such as therapy aimed at reducing unconstructive worry, can lead to a reduction in scores on the unconstructive worry scale over time. Finally, it is important to note that all validity testing to date has been conducted in samples from the general population, so the test should be further tested in samples from a clinical population of pathological worriers before test validity in this population can be assumed. The same applies to the use of the questionnaire in samples from non-US/Australian populations.
*This Close-Up was guest-authored by Ilona M. McNeill of The University of Melbourne, and Patrick D. Dunlop of The University of Western Australia.
JUST THINK . . .
What might be an example of a valid test used in an unfair manner?
Validity, Bias, and Fairness
In the eyes of many laypeople, questions concerning the validity of a test are intimately tied to questions concerning the fair use of tests and the issues of bias and fairness. Let us hasten to point out that validity, fairness in test use, and test bias are three separate issues. It is possible, for example, for a valid test to be used fairly or unfairly.
For the general public, the term bias as applied to psychological and educational tests may conjure up many meanings having to do with prejudice and preferential treatment (Brown et al., 1999). For federal judges, the term bias as it relates to items on children’s intelligence tests is synonymous with “too difficult for one group as compared to another” (Sattler, 1991). For psychometricians, bias is a factor inherent in a test that systematically prevents accurate, impartial measurement.
Psychometricians have developed the technical means to identify and remedy bias, at least in the mathematical sense. As a simple illustration, consider a test we will call the “flip-coin test” (FCT). The “equipment” needed to conduct this test is a two-sided coin. One side (“heads”) has the image of a profile and the other side (“tails”) does not. The FCT would be considered biased if the instrument (the coin) were weighted so that either heads or tails appears more frequently than by chance alone. If the test in question were an intelligence test, the test would be considered biased if it were constructed so that people who had brown eyes consistently and systematically obtained higher scores than people with green eyes—assuming, of course, that in reality people with brown eyes are not generally more intelligent than people with green eyes. Systematic is a key word in our definition of test bias. We have previously looked at sources of random or chance variation in test scores. Bias implies systematic variation.
Another illustration: Let’s suppose we need to hire 50 secretaries and so we place an ad in the newspaper. In response to the ad, 200 people reply, including 100 people who happen to have brown eyes and 100 people who happen to have green eyes. Each of the 200 applicants is individually administered a hypothetical test we will call the “Test of Secretarial Skills” (TSS). Logic tells us that eye color is probably not a relevant variable with respect to performing the duties of a secretary. We would therefore have no reason to believe that green-eyed people are better secretaries than brown-eyed people or vice versa. We might reasonably expect that, after the tests have been scored and the selection process has been completed, an approximately equivalent number of brown-eyed and green-eyed people would have been hired (or, approximately 25 brown-eyed people and 25 green-eyed people). But what if it turned out that 48 green-eyed people were hired and only 2 brown-eyed people were hired? Is this evidence that the TSS is a biased test?
Although the answer to this question seems simple on the face of it—“Yes, the test is biased because they should have hired 25 and 25!”—a truly responsible answer to this question would entail statistically troubleshooting the test and the entire selection procedure (see Berk, 1982). One reason some tests have been found to be biased has more to do with the design of the research study than the design of the test. For example, if there are too few testtakers in one of the groups (such as the minority group—literally), this methodological problem will make it appear as if the test is biased when in fact it may not be. A test may justifiably be deemed biased if some portion of its variance stems from some factor(s) that are irrelevant to performance on the criterion measure; as a consequence, one group of testtakers will systematically perform differently from another. Prevention during test development is the best cure for test bias, though a procedure called estimated true score transformations represents one of many available post hoc remedies (Mueller, 1949; see also Reynolds & Brown, 1984).5Page 193
A rating is a numerical or verbal judgment (or both) that places a person or an attribute along a continuum identified by a scale of numerical or word descriptors known as a rating scale . Simply stated, a rating error is a judgment resulting from the intentional or unintentional misuse of a rating scale. Thus, for example, a leniency error (also known as a generosity error ) is, as its name implies, an error in rating that arises from the tendency on the part of the rater to be lenient in scoring, marking, and/or grading. From your own experience during course registration, you might be aware that a section of a particular course will quickly Page 195be filled if it is being taught by a professor with a reputation for leniency errors in end-of-term grading. As another possible example of a leniency or generosity error, consider comments in the “Twittersphere” after a high-profile performance of a popular performer. Intuitively, one would expect more favorable (and forgiving) ratings of the performance from die-hard fans of the performer, regardless of the actual quality of the performance as rated by more objective reviewers. The phenomenon of leniency and severity in ratings can be found mostly in any setting that ratings are rendered. In psychotherapy settings, for example, it is not unheard of for supervisors to be a bit too generous or too lenient in their ratings of their supervisees.
Reviewing the literature on psychotherapy supervision and supervision in other disciplines, Gonsalvez and Crowe (2014) concluded that raters’ judgments of psychotherapy supervisees’ competency are compromised by leniency errors. In an effort to remedy the state of affairs, they offered a series of concrete suggestions including a list of specific competencies to be evaluated, as well as when and how such evaluations for competency should be conducted.
JUST THINK . . .
What factor do you think might account for the phenomenon of raters whose ratings always seem to fall victim to the central tendency error?
At the other extreme is a severity error . Movie critics who pan just about everything they review may be guilty of severity errors. Of course, that is only true if they review a wide range of movies that might consensually be viewed as good and bad.
Another type of error might be termed a central tendency error . Here the rater, for whatever reason, exhibits a general and systematic reluctance to giving ratings at either the positive or the negative extreme. Consequently, all of this rater’s ratings would tend to cluster in the middle of the rating continuum.
One way to overcome what might be termed restriction-of-range rating errors (central tendency, leniency, severity errors) is to use rankings , a procedure that requires the rater to measure individuals against one another instead of against an absolute scale. By using rankings instead of ratings, the rater (now the “ranker”) is forced to select first, second, third choices, and so forth.
Halo effect describes the fact that, for some raters, some ratees can do no wrong. More specifically, a halo effect may also be defined as a tendency to give a particular ratee a higher rating than he or she objectively deserves because of the rater’s failure to discriminate among conceptually distinct and potentially independent aspects of a ratee’s behavior. Just for the sake of example—and not for a moment because we believe it is even in the realm of possibility—let’s suppose Lady Gaga consented to write and deliver a speech on multivariate analysis. Her speech probably would earn much higher all-around ratings if given before the founding chapter of the Lady Gaga Fan Club than if delivered before and rated by the membership of, say, the Royal Statistical Society. This would be true even in the highly improbable case that the members of each group were equally savvy with respect to multivariate analysis. We would expect the halo effect to be operative at full power as Lady Gaga spoke before her diehard fans.
Criterion data may also be influenced by the rater’s knowledge of the ratee’s race or sex (Landy & Farr, 1980). Males have been shown to receive more favorable evaluations than females in traditionally masculine occupations. Except in highly integrated situations, ratees tend to receive higher ratings from raters of the same race (Landy & Farr, 1980). Returning to our hypothetical Test of Secretarial Skills (TSS) example, a particular rater may have had particularly great—or particularly distressing—prior experiences with green-eyed (or brown-eyed) people and so may be making extraordinarily high (or low) ratings on that irrational basis.
Training programs to familiarize raters with common rating errors and sources of rater bias have shown promise in reducing rating errors and increasing measures of reliability and validity. Lecture, role playing, discussion, watching oneself on videotape, and computer simulation of different situations are some of the many techniques that could be brought to bear in such training programs. We revisit the subject of rating and rating error in our discussion of personality assessment later. For now, let’s take up the issue of test fairness.Page 196
In contrast to questions of test bias, which may be thought of as technically complex statistical problems, issues of test fairness tend to be rooted more in thorny issues involving values (Halpern, 2000). Thus, although questions of test bias can sometimes be answered with mathematical precision and finality, questions of fairness can be grappled with endlessly by well-meaning people who hold opposing points of view. With that caveat in mind, and with exceptions most certainly in the offing, we will define fairness in a psychometric context as the extent to which a test is used in an impartial, just, and equitable way.6
Some uses of tests are patently unfair in the judgment of any reasonable person. During the cold war, the government of what was then called the Soviet Union used psychiatric tests to suppress political dissidents. People were imprisoned or institutionalized for verbalizing opposition to the government. Apart from such blatantly unfair uses of tests, what constitutes a fair and an unfair use of tests is a matter left to various parties in the assessment enterprise. Ideally, the test developer strives for fairness in the test development process and in the test’s manual and usage guidelines. The test user strives for fairness in the way the test is actually used. Society strives for fairness in test use by means of legislation, judicial decisions, and administrative regulations.
Fairness as applied to tests is a difficult and complicated subject. However, it is possible to discuss some rather common misunderstandings regarding what are sometimes perceived as unfair or even biased tests. Some tests, for example, have been labeled “unfair” because they discriminate among groups of people.7 The reasoning here goes something like this: “Although individual differences exist, it is a truism that all people are created equal. Accordingly, any differences found among groups of people on any psychological trait must be an artifact of an unfair or biased test.” Because this belief is rooted in faith as opposed to scientific evidence—in fact, it flies in the face of scientific evidence—it is virtually impossible to refute. One either accepts it on faith or does not.
We would all like to believe that people are equal in every way and that all people are capable of rising to the same heights given equal opportunity. A more realistic view would appear to be that each person is capable of fulfilling a personal potential. Because people differ so obviously with respect to physical traits, one would be hard put to believe that psychological differences found to exist between individuals—and groups of individuals—are purely a function of inadequate tests. Again, although a test is not inherently unfair or biased simply because it is a tool by which group differences are found, the useof the test data, like the use of any data, can be unfair.
Another misunderstanding of what constitutes an unfair or biased test is that it is unfair to administer to a particular population a standardized test that did not include members of that population in the standardization sample. In fact, the test may well be biased, but that must be determined by statistical or other means. The sheer fact that no members of a particular group were included in the standardization sample does not in itself invalidate the test for use with that group.
A final source of misunderstanding is the complex problem of remedying situations where bias or unfair test usage has been found to occur. In the area of selection for jobs, positions in universities and professional schools, and the like, a number of different preventive measures and remedies have been attempted. As you read about the tools used in these attempts in this chapter’s Everyday Psychometrics , form your own opinions regarding what constitutes a fair use of employment and other tests in a selection process.Page 197
Adjustment of Test Scores by Group Membership: Fairness in Testing or Foul Play?
Any test, regardless of its psychometric soundness, may be knowingly or unwittingly used in a way that has an adverse impact on one or another group. If such adverse impact is found to exist and if social policy demands some remedy or an affirmative action program, then psychometricians have a number of techniques at their disposal to create change. Table 1 lists some of these techniques.
Psychometric Techniques for Preventing or Remedying Adverse Impact and/or Instituting an Affirmative Action Program
Some of these techniques may be preventive if employed in the test development process, and others may be employed with already established tests. Some of these techniques entail direct score manipulation; others, such as banding, do not. Preparation of this table benefited from Sackett and Wilk (1994), and their work should be consulted for more detailed consideration of the complex issues involved.
|Addition of Points||1. A constant number of points is added to the test score of members of a particular group. The purpose of the point addition is to reduce or eliminate observed differences between groups.|
|1. Differential Scoring of Items||1. This technique incorporates group membership information, not in adjusting a raw score on a test but in deriving the score in the first place. The application of the technique may involve the scoring of some test items for members of one group but not scoring the same test items for members of another group. This technique is also known as empirical keying by group.|
|1. Elimination of Items Based on Differential Item Functioning||1. This procedure entails removing from a test any items found to inappropriately favor one group’s test performance over another’s. Ideally, the intent of the elimination of certain test items is not to make the test easier for any group but simply to make the test fairer. Sackett and Wilk (1994) put it this way: “Conceptually, rather than asking ‘Is this item harder for members of Group X than it is for Group Y?’ these approaches ask ‘Is this item harder for members of Group X with true score Z than it is for members of Group Y with true score Z?’”|
|1. Differential Cutoffs||1. Different cutoffs are set for members of different groups. For example, a passing score for members of one group is 65, whereas a passing score for members of another group is 70. As with the addition of points, the purpose of differential cutoffs is to reduce or eliminate observed differences between groups.|
|1. Separate Lists||1. Different lists of testtaker scores are established by group membership. For each list, test performance of testtakers is ranked in top-down fashion. Users of the test scores for selection purposes may alternate selections from the different lists. Depending on factors such as the allocation rules in effect and the equivalency of the standard deviation within the groups, the separate-lists technique may yield effects similar to those of other techniques, such as the addition of points and differential cutoffs. In practice, the separate list is popular in affirmative action programs where the intent is to overselect from previously excluded groups.|
|1. Within-Group Norming||1. Used as a remedy for adverse impact if members of different groups tend to perform differentially on a particular test, within-group norming entails the conversion of all raw scores into percentile scores or standard scores based on the test performance of one’s own group. In essence, an individual testtaker is being compared only with other members of his or her own group. When race is the primary criterion of group membership and separate norms are established by race, this technique is known as race-norming.|
|1. Banding||1. The effect of banding of test scores is to make equivalent all scores that fall within a particular range or band. For example, thousands of raw scores on a test may be transformed to a stanine having a value of 1 to 9. All scores that fall within each of the stanine boundaries will be treated by the test user as either equivalent or subject to some additional selection criteria. A sliding band (Cascio et al., 1991) is a modified banding procedure wherein a band is adjusted (“slid”) to permit the selection of more members of some group than would otherwise be selected.|
|1. Preference Policies||1. In the interest of affirmative action, reverse discrimination, or some other policy deemed to be in the interest of society at large, a test user might establish a policy of preference based on group membership. For example, if a municipal fire department sought to increase the representation of female personnel in its ranks, it might institute a test-related policy designed to do just that. A key provision in this policy might be that when a male and a female earn equal scores on the test used for hiring, the female will be hired.|
Although psychometricians have the tools to institute special policies through manipulations in test development, scoring, and interpretation, there are few clear guidelines in this controversial area (Brown, 1994; Gottfredson, 1994, 2000; Sackett & Wilk, 1994). The waters are further muddied by the fact that some of the guidelines seem to have contradictory implications. For example, although racial preferment in employee selection (disparate impact) is unlawful, the use of valid and unbiased selection procedures virtually guarantees disparate impact. This state of affairs will change only when racial disparities in job-related skills and abilities are minimized (Gottfredson, 1994).
In 1991, Congress enacted legislation effectively barring employers from adjusting testtakers’ scores for the purpose of making hiring or promotion decisions. Section 106 of the Civil Rights Act of 1991 made it illegal for employers “in connection with the selection or referral of applicants or candidates for employment or promotion to adjust the scores of, use different cutoffs for, or otherwise alter the results of employment-related tests on the basis of race, color, religion, sex, or national origin.”
The law prompted concern on the part of many psychologists who believed it would adversely affect various societal groups and might reverse social gains. Brown (1994, p. 927) forecast that “the ramifications of the Act are more far-reaching than Congress envisioned when it considered the amendment and could mean that many personality tests and physical ability tests that rely on separate scoring for men and women are outlawed in employment selection.” Arguments in favor of group-related test-score adjustment have been made on philosophical as well as technical grounds. From a philosophical perspective, increased minority representation is socially valued to the point that minority preference in test scoring is warranted. In the same vein, minority preference is viewed both as a remedy for past societal wrongs and as a contemporary guarantee of proportional workplace representation. From a more technical perspective, it is argued that some tests require adjustment in scores because (1) the tests are biased, and a given score on them does not necessarily carry the same meaning for all testtakers; and/or (2) “a particular way of using a test is at odds with an espoused position as to what constitutes fair use” (Sackett & Wilk, 1994, p. 931).
In contrast to advocates of test-score adjustment are those who view such adjustments as part of a social agenda for preferential treatment of certain groups. These opponents of test-score adjustment reject the subordination of individual effort and ability to group membership as criteria in the assignment of test scores (Gottfredson, 1988, 2000). Hunter and Schmidt (1976, p. 1069) described the unfortunate consequences for all parties involved in a college selection situation wherein poor-risk applicants were accepted on the basis of score adjustments or quotas. With reference to the employment setting, Hunter and Schmidt (1976) described one case in which entrance standards were lowered so more members of a particular group could be hired. However, many of these new hires did not pass promotion tests—with the result that the company was sued for discriminatory promotion practice. Yet another consideration concerns the feelings of “minority applicants who are selected under a quota system but who also would have been selected under unqualified individualism and must therefore pay the price, in lowered prestige and self-esteem” (Jensen, 1980, p. 398).
A number of psychometric models of fairness in testing have been presented and debated in the scholarly literature (Hunter & Schmidt, 1976; Petersen & Novick, 1976; Schmidt & Hunter, 1974; Thorndike, 1971). Despite a wealth of research and debate, a long-standing question in the field of personnel psychology remains: “How can group differences on cognitive ability tests be reduced while retaining existing high levels of reliability and criterion-related validity?”
According to Gottfredson (1994), the answer probably will not come from measurement-related research because differences in scores on many of the tests in question arise principally from differences in job-related abilities. For Gottfredson (1994, p. 963), “the biggest contribution personnel psychologists can make in the long run may be to insist collectively and candidly that their measurement tools are neither the cause of nor the cure for racial differences in job skills and consequent inequalities in employment.”
Beyond the workplace and personnel psychology, what role, if any, should measurement play in promoting diversity? As Haidt et al. (2003) reflected, there are several varieties of diversity, some perceived as more valuable than others. Do we need to develop more specific measures designed, for example, to discourage “moral diversity” while encouraging “demographic diversity”? These types of questions have implications in a number of areas from academic admission policies to immigration.
JUST THINK . . .
How do you feel about the use of various procedures to adjust test scores on the basis of group membership? Are these types of issues best left to measurement experts?
If performance differences are found between identified groups of people on a valid and reliable test used for selection purposes, some hard questions may have to be dealt with if the test is to continue to be used. Is the problem due to some technical deficiency in the test, or is the test in reality too good at identifying people of different levels of ability? Regardless, is the test being used fairly? If so, what might society do to remedy the skill disparity between different groups as reflected on the test?
Our discussion of issues of test fairness and test bias may seem to have brought us far afield of the seemingly cut-and-dried, relatively nonemotional subject of test validity. However, the complex issues accompanying discussions of test validity, including issues of fairness and bias, must be wrestled with by us all. For further consideration of the philosophical issues involved, we refer you to the solitude of your own thoughts and the reading of your own conscience.
Test your understanding of elements of this chapter by seeing if you can explain each of the following terms, expressions, and abbreviations:
· base rate
· central tendency error
· concurrent validity
· confirmatory factor analysis
· construct validity
· content validity
· convergent evidence
· convergent validity
· criterion contamination
· criterion-related validity
· discriminant evidence
· expectancy chart
· expectancy data
· exploratory factor analysis
· face validity
· factor analysis
· factor loading
· false negative
· false positive
· generosity error
· halo effect
· hit rate
· incremental validity
· intercept bias
· leniency error
· local validation study
· method of contrasted groups
· miss rate
· multitrait-multimethod matrix
· predictive validity
· rating error
· rating scale
· severity error
· slope bias
· test blueprint
· validation study
· validity coefficient
The post Validity , as applied to a test, is a judgment or estimate of how well a test measures what it purports to measure in a particular context. More specifically, it is a judgment based on evidence about the appropriateness of inferences drawn from test scores appeared first on Smart Essays.