When a persons test performance can be compared with that of a representative pretested group The test is said to be?

Computerized Test Construction

W.J. van der Linden, in International Encyclopedia of the Social & Behavioral Sciences, 2001

In computerized test construction, a combination of items is selected from an item bank that is optimal in some sense and satisfies a set of constraints representing the content specifications for the test. Formally, this problem has the structure of a constrained combinatorial optimization problem in which an objective function is optimized subject to a set of constraints, both typically modeled using 0–1 decision variables for the inclusion of the items in the test. Though different kinds of objective functions can be formulated, the most popular one matches the information function of the test to a set of target values. Three different kinds of constraints are discussed, namely constraints on: [a] categorical item attributes; [b] quantitative item attributes; and [c] logical relations between the items in the test. Applications of the methodology to the problems of assembling tests with a linear, sequential, and adaptive format are reviewed.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B0080430767006549

Computerized Test Construction

Bernard P. Veldkamp, in International Encyclopedia of the Social & Behavioral Sciences [Second Edition], 2015

Robust Computerized Test Construction

In all these test construction models, it is assumed that the item parameters and item attributes are known. Fixed values are used to compute the contribution of each item to both the objective function and the constraints. Unfortunately, many of them have been estimated. IRT parameters, for example, have been estimated when the items were pretested, and they do have uncertainty in them that is reflected by the standard error of estimation. This uncertainty is generally not taken into account. Hambleton and Jones [1994] already warned about the consequences. When the objective of test construction is to maximize the information in the test, and the uncertainty in the item parameters is not taken into account, the amount of information in the resulting test will be seriously overestimated, especially when the item parameters have been estimated with small sample sizes or when relatively few items are selected. Since the amount of information in the test is inversely related to the variance of the ability estimate, this implies that overestimation of the information in the test has serious consequences for the measurement precision and the reliability of the test.

De Jong et al. [2009] proposed a robust test construction algorithm to deal with these problems. Inspired by the work of Soyster [1973], they subtracted one time the standard of estimation from the parameter estimates, and formulated a robust test construction model as

[26]max∑i=1IIi[θj,ζ]xi[objective function]

where ζ denotes the level of uncertainty, and

[27]Ii [θj,ζ]=Ii[θj]−SE[ Ii[θj]].

For small values of the standard error of estimation, the results for the test construction problem in Eqns [5]–[10] and its robust counterpart are almost the same, but for large values, a considerable difference will be obtained.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780080970868430199

Item Response Theory

Wim J. van der Linden, in Encyclopedia of Social Measurement, 2005

Item Banking

Traditionally, test construction went through a cycle in which a set of items was written that represented the variable best. The set was typically somewhat larger than needed to allow for some items to fail an empirical pretest. The set was then pretested and the item statistics were analyzed. The final step was to put the test together using the item statistics. If a new test was needed, the same cycle of item writing, pretesting, and item analysis was repeated.

In item banking, items are written, pretested, and calibrated on a continuous basis. Items that pass the quality criteria are added to a pool of items from which new tests are assembled when needed. The advantage of item banking is not only a more efficient use of the resources, but also tests of higher quality assembled from a larger stock of pretested items, less vulnerability to the consequences of possible item security breaches, and stable scales defined by larger pools of items.

In item banking, problems of item calibration and person scoring are always subject to large numbers of missing data, due to the fact that no person responds to all items in the pool. IRT deals with these problems in a natural way. If new items are calibrated, a few existing items from the pool are added to the tests and their known item parameter values are used to link the new items to the scale already established for the pool. Likewise, if a person is tested, any set of items from the pool can be used as a test. When estimating the person's score, the estimation equation has known values for the item parameters and automatically accounts for the properties of the items. Consequently, in more statistical terms, differences in item selection do not create any [asymptotic] bias in the ability estimates.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B0123693985004527

Development of a Scientific Test: A Practical Guide

Michael C. Ramsay, Cecil R. Reynolds, in Handbook of Psychological Assessment [Third Edition], 2000

INTRODUCTION

Most authors portray test construction as a matter of carefully composing groups of items, administering them to a representative sample of people, and analyzing the responses using established statistical techniques. Many writers [e.g., Allen & Yen, 1979; Anstey, 1966; Kline, 1986; Robertson, 1990] lay out steps for the prospective test developer to follow. They often look something like this [adapted from Allen & Yen, 1979]:

1.

Develop a plan to cover the desired content.

2.

Design items that fit the plan.

3.

Conduct a trial administration.

4.

Analyze the results, and modify the test if needed.

5.

Administer the test again.

6.

Repeat as necessary, beginning with Step 2 or 4.

The ensuing pages will, to a large extent, cajole and bludgeon the reader to follow this same time-tested trail to reliability and validity. Recent trends in test development, however, suggest that constructing a test requires more than close attention to the test itself. As growing numbers of psychologists and psychometricians tiptoe across the lines that once divided them, test developers are increasingly turning to mainstream research to test, and even to provide a basis for, the measures they develop.

The viewpoint that a good test should have a basis in empirical research, not theory alone [e.g., Reynolds & Bigler, 1995a, 1995b; Reynolds & Kamphaus, 1992a, 1992b], has gained in popularity. At a minimum, most test constructors would agree that a well-designed experiment can help explain the characteristics measured by a test [Embretson, 1985]. Inevitably, such constructs as aptitude and schizophrenia do not respond readily to laboratory controls. Furthermore, obvious ethical constraints prevent scientists from manipulating certain variables, such as suicidal tendencies, and from holding others constant, such as learning rate. These considerations should restrain the influence of experimentalism on testing. Still, the foundations of testing have subtly shifted, and to some degree, the content of this chapter reflects this shift.

Theory, too, has played an important role in psychological test development. Before devising the ground-breaking Metrical Scale of Intelligence with Theodore Simon, Alfred Binet spent many years building and refining a concept of intelligence [Anastasi, 1986; Binet & Simon, 1916/1980]. Soon after Binet and Simon released their scale, Charles Spearman [1927; 1923/1973; Spearman & Jones, 1950] presented a model of intelligence derived from correlational studies. Spearman [1904] posited four kinds of intelligence: present efficiency, native capacity, common sense, and the impression made on other people. Modem psychologists might recast these constructs loosely as achievement, ability, practical intelligence, and social intelligence. Thus, Spearman’s model begins to resemble recent theories that include practical, tacit, and social intelligence [Gardner, 1983; Sternberg, 1985, 1990; Sternberg & Wagner, 1986]. Notably, however, Spearman’s model appears to omit creativity. Other contributors to the theory and modeling of mental ability include Guilford [1967, 1977], Cattell [1971], Thurstone [1938; Thurstone & Thurstone, 1941], Luria [1962/1980, 1966, 1972], Das, Kirby, and Jarman, [1979], and Kaufman and Kaufman [1983a, 1983b, 1983c]. Personality testing, too, has a diverse theoretical base. Personality theories linked with testing include the big five personality factors [Costa & McCrae, 1992a, 1992b; John, 1990; Norman, 1963], trait or disposition theory [Mischel, 1990; Zeidner, 1995], Guilford’s 14 personality dimensions [Guilford & Zimmerman, 1956] and Murray’s manifest need system [Anastasi, 1988; Murray, 1938]. The role of theory in test development becomes important in construct definition, Step 2 below.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B978008043645650080X

Test Development

S.M. Downing, in International Encyclopedia of Education [Third Edition], 2010

Introduction

Test development or test construction refers to the science and art of planning, preparing, administering, scoring, statistically analyzing, and reporting results of tests. This article emphasizes a systematic process used to develop tests in order to maximize validity evidence for scores resulting from those tests.

The point of view of this article is large-scale, high-stakes cognitive testing, scored using classical measurement theory [CMT], which also provides an adequate model for most classroom tests. The selected-response test item format, exemplified by the prototypic multiple-choice item, is an excellent format for testing student cognitive achievement at most levels of schooling and at all levels of the cognitive taxonomy. The 12-step process model is also useful for other testing formats, including performance-type tests using item formats such as the constructed-response form, simulations, structured oral examinations, and all other blended or combination formats.

This test development process model applies equally well to nearly all types of cognitive tests, including those intended to measure achievement or ability, tests used for academic or job placement, tests intended to diagnose learner's strengths and limitations, tests intended to provide learners with formative feedback on progress as well as a final summative measure of cognitive achievement in a domain, and occupational-type tests used to credentialize individuals for a profession or occupation.

The emphasis on some steps will depend on the exact purpose of the test being developed. Some steps or tasks may be subsumed into other tasks or dismissed as unnecessary for a particular testing situation. This process model, summarized in Table 1, may be reordered or reorganized, given special requirements for the proposed test, but most tasks will be required to be carried out for most tests, whatever format is used.

Table 1. Twelve-step process model for test development

StepsExample test-development tasks
1. Complete plan Systematic plan to guide all other test-development activities: clear statement of test's purpose and definition of the construct; desired test-score interpretations or inferences; planned validity evidence; rationale for test format[s]; psychometric model; timelines; security; quality control plan
2. Definition of content Exact sampling plan relating items to domain; rational or empirical foundation; major source of content-related validity evidence
3. Test specifications Content operationally defined and specifically related to Step 2; documents content definition; provides content-related validity evidence for score inferences to domain
4. Developing items Creation of effective test stimuli or item formats; principles of effective item writing; item writer training; content and editorial review to reduce bias
5. Test design and assembly Designing and operationalizing test forms; item or prompt selection; operational sampling by planned blueprint; pretesting, item tryout
6. Producing test forms Publishing or producing tests; security, validity, quality control issues
7. Administration Validity issues concerned with standardization; fairness issues; proctoring, security, timing issues
8. Response scoring Validity issues; psychometric model; quality control; key validation; item analysis
9. Establishing passing scores Establishing defensible cut or passing scores; relative vs. absolute methods; validity issues; comparability of standards and constancy of score scale
10. Reporting test results Validity issues: accuracy, quality control; timely; misuse issues
11. Item banking Effective item banking; security; flexibility
12. Documentation – technical report Process documentation: thorough, detailed documentation for validity evidence

Reproduced [in modified form] from Downing, S. M. [2006]. Twelve steps for effective test development. In Downing, S. M. and Haladyna, T. M. [eds]. Handbook of Test Development, pp 3–25. Mahwah, NJ: Lawrence Erlbaum Associates, Copyright 2006, with permission of Lawrence Erlbaum Associates [Taylor & Francis Group, LLC].

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780080448947002700

Optimal Test Construction

Bernard P. Veldkamp, in Encyclopedia of Social Measurement, 2005

Conclusion and Discussion

The main issue in optimal test construction is how to formulate a test assembly model. Models for a number of optimal test construction problems have been introduced. However, all these models are based on the general test construction model in Eqs. [14]. They may need a different objective function, some additional constraints, or different definitions of the decision variables, but the structure of the model remains the same. When different optimal test construction problems have to be solved, the question is not how to find a new method but how to define an appropriate objective function and a set of constraints.

The weighted deviations model is an alternative to the linear programming model. All the models presented in the third section could also be written as weighted deviation models. Whether linear programming models or weighted deviation models should be applied depends on the nature of the specifications. When the specifications have to be met, linear programming models are more suitable, but when the specifications are less strict, the weighted deviations model can be used. In practical testing situations, it may even happen that a combination of the two models can be applied if only some of the specifications have to be met. The same algorithms and heuristics can solve both the linear programming models and the weighted deviation models.

Finally, some remarks have to be made about the quality of optimal test construction methods. The models, algorithms, and heuristics presented here are very effective in constructing optimal tests. Additional gains are possible by improving the quality of the item pool. Some efforts have already been made to develop an optimal blueprint for item pool design that would combine test specifications and optimal test construction methods to develop better item pools. This will further increase measurement precision.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B0123693985004473

Fairness

C. Gipps, G. Stobart, in International Encyclopedia of Education [Third Edition], 2010

Bias

Bias is a technical term used in test construction. In layman’s terms a test is biased if “two individuals with equal ability [in the subject being tested] but from different groups do not have the same probability of success” [Shepard et al., 1981].

A main cause of bias in a test would be that it was designed by one cultural group to reflect their own experience, and thus disadvantage test-takers from other cultural groups, as in the example of IQ tests given above. Thus, bias may be due to the content matter in a test. However, it may also be due to lack of clarity in instructions which leads to differential responses from different groups, or to scoring systems that do not credit appropriate or correct responses that are more typical of one group than another.

The existence of group differences in performances on tests is often taken to imply that the tests are biased, the assumption being that one group is not inherently less able than the other. However, as we have argued, the two groups may well have been subject to different environmental experiences or unequal access to the curriculum. This difference will be reflected in group test scores, but the test is not strictly speaking biased.

There are technical approaches to evaluating bias, which we discuss below.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B978008044894700244X

Interest Inventories

Jo-lda C. Hansen Ph.D., in Handbook of Psychological Assessment [Third Edition], 2000

Occupational Scales

The Occupational Scales of the Strong are another example of test construction using the empirical method of contrast groups. The response-rate percentage of the occupational criterion sample to each item is compared to the response-rate percentage of the appropriate-sex contrast sample [i.e., General Reference Sample of females or males] to identify items that differentiate the two samples. Usually 30 to 50 items are identified as the interests [”Likes”] or the aversions [”Dislikes”] of each occupational criterion sample. The raw scores for an individual scored on the Occupational Scales are converted to standard scores based on the occupational criterion sample, with mean set equal to 50 and standard deviation of 10 [see Figure 9.6].

Figure 9.6. Profile for the General Occupational Themes and the Basic Interest Scales of the Strong.

Modified and reproduced by special permission of the Publisher, Consulting Psychologists Press, Inc., Palo Alto, CA 94303 from the Strong Interest Inventory™ of the Strong Vocational Interest Blanks[R] Form T317. Copyright 1933, 1938, 1945, 1946, 1966, 1968, 1974, 1981,1985, 1994 by The Board of Trustees of the Leland Stanford Junior University. All rights reserved. Printed under license from Stanford University Press, Stanford, CA 94305. Further reproduction is prohibited without the Publisher's written consent. Strong Interest Inventory is a trademark and Strong Vocational Interest Blanks is a registered trademark of the Stanford University Press.

For most occupations, matched-sex scales are presented on the Strong Interest Inventory profile. However, seven of the 109 occupations [211 Scales] are represented by just one scale [e.g., f Child Care Provider, f Dental Assistant, f Dental Hygienist, f Home Economics Teacher, f Secretary, m Agribusiness Manager, and m Plumber].

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780080436456500872

Achievement Tests

Gregory J. Cizek, in Encyclopedia of Applied Psychology, 2004

4 Achievement Test Construction

Rigorous achievement test development consists of numerous common steps. Achievement test construction differs slightly based on whether the focus of the assessment is classroom use or larger scale. Table I provides a sequence listing 18 steps that would be common to most achievement test development.

TABLE I. Common Steps in Achievement Test Development

1. Establish need, purpose
2. Delineate domain to be tested
3. Develop specific objectives, content standards
4. Decide on item and test specifications, formats, length, costs
5. Develop items, tasks, scoring guides
6. Conduct item/task review [editorial, appropriateness, alignment, sensitivity]
7. Pilot/Field test items/tasks/scoring guides
8. Review item/task performance
9. Create item bank/pool
11. Assemble test form[s] according to specifications
11. Develop test administration guidelines, materials
12. Establish performance standards
13. Administer operational test forms
14. Score test
15. Evaluate preliminary item/task and examinee performance
16. Report scores to appropriate audiences
17. Evaluate test, document test cycle
18. Update item pool, revise development procedures, develop new items/tasks

In both large and smaller contexts, the test maker would begin with specification of a clear purpose for the test or battery and a careful delineation of the domain to be sampled. Following this, the specific standards or objectives to be tested are developed. If it is a classroom achievement test, the objectives may be derived from a textbook, an instructional unit, a school district curriculum guide, content standards, or another source. Larger scale achievement tests [e.g., state mandated, standards referenced] would begin the test development process with reference to adopted state content standards. Standardized norm-referenced instruments would ordinarily be based on large-scale curriculum reviews, based on analysis of content standards adopted in various states, or promulgated by content area professional associations. Licensure, certification, or other credentialing tests would seek a foundation in job analysis or survey of practitioners in the particular occupation. Regardless of the context, these first steps involving grounding of the test in content standards, curriculum, or professional practice provide an important foundation for the validity of eventual test score interpretations.

Common next steps would include deciding on and developing appropriate items or tasks and related scoring guides to be field tested prior to actual administration of the test. At this stage, test developers pay particular attention to characteristics of items and tasks [e.g., clarity, discriminating power, amenability to dependable scoring] that will promote reliability of eventual scores obtained by examinees on the operational test.

Following item/task tryout in field testing, a database of acceptable items or tasks, called an item bank or item pool, would be created. From this pool, operational test forms would be drawn to match previously decided test specifications. Additional steps would be required, depending on whether the test is to be administered via paper-and-pencil format or computer. Ancillary materials, such as administrator guides and examinee information materials, would also be produced and distributed in advance of test administration. Following test administration, an evaluation of testing procedures and test item/task performance would be conducted. If obtaining scores on the current test form that were comparable to scores from a previous test administration is required, then statistical procedures for equating the two test forms would take place. Once quality assurance procedures have ensured accuracy of test results, scores for examinees would be reported to individual test takers and other groups as appropriate. Finally, documentation of the entire process would be gathered and refinements would be made prior to cycling back through the steps to develop subsequent test forms [Steps 5–18].

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B0126574103002269

Structured Clinical Interviews for Adults

Arthur N. Wiens, Patricia J. Brazil, in Handbook of Psychological Assessment [Third Edition], 2000

Computer Development

Computers have long played a significant role in assessment. Much modern test construction has been dependent on the availability of computing resources. As test administration itself became more feasible with the advent of microcomputers, one of the questions raised was the comparability of data obtained with traditional paper-and-pencil administration and computerized administration. Lukin, Dowd, Plake, & Kraft [1985] obtained no significant differences between scores on measures of anxiety, depression, and psychological reactance across administration format. Most important, while producing results comparable to the pencil-and-paper assessment, the computerized administration was preferred over the pencil-and-paper administration by 85 percent of the subjects.

Ferriter [1993] compared three interview methods for collecting information for psychiatric social histories: human unstructured interviewing, human structured interviewing, and the same structured interview delivered by computer. He concluded that structured interviewing collected significantly more information than unstructured interviewing. A comparison of structured human and computer interviews showed greater extremes of response with fewer discrepancies of fact in the computer condition, indicating greater candidness of subjects in that group, and therefore, greater validity of data collected by computer.

Choca & Morris [1992] compared a computerized version of the Halstead Category Test to the standard projector version of the test using neurologically impaired adult patients. Every patient was tested with both versions and the order of administration was alternated. Results indicated that difference in mean number of errors made between the two versions of the test was not significant. The scores obtained with the two versions were seen as similar to what would be expected from a test-retest administration of the same instrument. The authors note that one advantage of the computerized version is that it assures an error-free administration of the test. Secondly, the computer version allows the collection of additional data as the test is administered, such as the reaction time and the number of perseverations when a previous rale is inappropriately used. Finally, it may be eventually possible to show that promptings from the examiner do not make a significant difference in terms of the eventual outcome. If this were the case, the computer version would have the added advantage of requiring a considerably smaller time commitment by the examiner [Choca & Morris, 1992, p. 11-12].

These studies, and others not reviewed here, support the contention that computerized testing techniques provide results comparable to traditional assessment techniques when using individual tests.

While psychological software has not kept pace with hardware development, the availability of new programs of interest to psychologists and other clinicians has been dramatic. Samuel E. Krug [1987] compiled a product listing that included more than 300 programs designed to assess or modify human behavior. Of these listings, eight percent were categorized as structured interviews and were most likely to be described as intake interviews. Krug [1993] subsequently has compiled a similar product listing that now includes more than 500 programs designed to assess or modify human behavior. Of these listings about 11 percent are categorized as structured interviews. The products in this category almost always are designed to be self-administered.

One of the earlier proponents of automated computer interviewing, John H. Greist [1984], observed that clinician training, recent experience, immediate distractions, and foibles of memory are among the factors that may compromise our competence as diagnosticians. Further, he stated that in virtually every instance in which computer interviews and clinicians have been compared, the computer outperforms the clinician in terms of completeness and accuracy. Erdman, Klein, & Greist [1985] suggest that one appeal of computer interviewing is the ability of the computer to imitate, even if only to a limited degree, the intelligence of a human interviewer. Like a human interviewer, the computer can be programmed to ask follow-up questions for problems that the respondent reports, and to skip follow-up questions in those areas of no problem. This branching capability leads to an interaction between computer and human, that is, what happens in the interview depends on what the subject says. Thus, a computer interview can be tailored to the person using the program, for example, not to ask a male subject about pregnancy. Of more interest though is the capacity to ask follow-up questions in the subject’s own words and to compare responses from different points in the interview. While it has been asserted that computers cannot detect flat affect, Erdman, Klein, & Greist [1985] do note that it is possible to record response latency and heart rate simultaneously and to use these variables to branch into questions/comments regarding emotional arousal. It does seem clear, however, that to date it has not been possible for the computer to report the many nonverbal cues to which a human interviewer could observe and respond. To be candid, however, it must be acknowledged that a human interviewer also remains oblivious to a great deal of information available in a two-person interaction.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780080436456500938

Which concept measures the extent to which a test yields consistent results?

Reliability is the degree to which an assessment tool produces stable and consistent results. Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals.

Which of the following terms refers to a person's ability to reason speedily and abstractly?

fluid intelligence. refers to a person's ability to reason speedily and abstractly. Fluid intelligence tends to decline with age. formal operational stage. in Piaget's theory normally begins about age 12.

Is a condition in which a person otherwise limited in mental ability has an exceptional specific skill such as in computation or drawing?

Savant Syndrome - a condition in which a person otherwise limited in mental ability has an exceptional specific skill, such as in computation or drawing.

What is the widespread improvement in intelligence test performance during the past century is called?

The Flynn effect refers to the consistent upward drift in IQ test scores across generations which has been documented to be approximately 3 points per decade.

Chủ Đề