What is more important validity or reliability in assessment?

Validity and reliability of assessment methods are considered the two most important characteristics of a well-designed assessment procedure.

Validity refers to the degree to which a method assesses what it claims or intends to assess. The different types of validity include:

ValidityDefinition

content

the assessment method matches the content of the work

criterion

relates to whether the assessment method is explicit in terms of procedures correlating with particular behaviours

construct

relates to whether scores reflect the items being tested.5,13

Performance based assessments are typically viewed as providing more valid data than traditional examinations because they focus more directly on the tasks or skills of practice.2

Reliability refers to the extent to which an assessment method or instrument measures consistently the performance of the student. Assessments are usually expected to produce comparable outcomes, with consistent standards over time and between different learners and examiners.  However, the following factors impede both the validity and reliability of assessment practices in workplace settings:

  • inconsistent nature of people
  • reliance on assessors to make judgements without bias
  • changing contexts/conditions
  • evidence of achievement arising spontaneously or incidentally.2,13

Explicit performance criteria enhance both the validity and reliability of the assessment process.  Clear, usable assessment criteria contribute to the openness and accountability of the whole process.  The context, tasks and behaviours desired are specified so that assessment can be repeated and used for different individuals.  Explicit criteria also counter criticisms of subjectivity.13

Corresponding author: Gail M. Sullivan, MD, MPH, Editor-in-Chief, Journal of Graduate Medical Education, 515 N State St, Suite 2000, ude.chcu.1osn@navillusg

Copyright Accreditation Council for Graduate Medical Education

This article has been corrected. See J Grad Med Educ. 2011 September; 3(3): 446.

1. What is reliability?

Reliability refers to whether an assessment instrument gives the same results each time it is used in the same setting with the same type of subjects. Reliability essentially means consistent or dependable results. Reliability is a part of the assessment of validity.

2. What is validity?

Validity in research refers to how accurately a study answers the study question or the strength of the study conclusions. For outcome measures such as surveys or tests, validity refers to the accuracy of measurement. Here validity refers to how well the assessment tool actually measures the underlying outcome of interest. Validity is not a property of the tool itself, but rather of the interpretation or specific purpose of the assessment tool with particular settings and learners.

Assessment instruments must be both reliable and valid for study results to be credible. Thus, reliability and validity must be examined and reported, or references cited, for each assessment instrument used to measure study outcomes. Examples of assessments include resident feedback survey, course evaluation, written test, clinical simulation observer ratings, needs assessment survey, and teacher evaluation. Using an instrument with high reliability is not sufficient; other measures of validity are needed to establish the credibility of your study.

3. How is reliability measured?–

Reliability can be estimated in several ways; the method will depend upon the type of assessment instrument. Sometimes reliability is referred to as internal validity or internal structure of the assessment tool.

For internal consistency 2 to 3 questions or items are created that measure the same concept, and the difference among the answers is calculated. That is, the correlation among the answers is measured.

Cronbach alpha is a test of internal consistency and frequently used to calculate the correlation values among the answers on your assessment tool. Cronbach alpha calculates correlation among all the variables, in every combination; a high reliability estimate should be as close to 1 as possible.

For test/retest the test should give the same results each time, assuming there are no interval changes in what you are measuring, and they are often measured as correlation, with Pearson r.

Test/retest is a more conservative estimate of reliability than Cronbach alpha, but it takes at least 2 administrations of the tool, whereas Cronbach alpha can be calculated after a single administration. To perform a test/retest, you must be able to minimize or eliminate any change (ie, learning) in the condition you are measuring, between the 2 measurement times. Administer the assessment instrument at 2 separate times for each subject and calculate the correlation between the 2 different measurements.

Interrater reliability is used to study the effect of different raters or observers using the same tool and is generally estimated by percent agreement, kappa (for binary outcomes), or Kendall tau.

Another method uses analysis of variance (ANOVA) to generate a generalizability coefficient, to quantify how much measurement error can be attributed to each potential factor, such as different test items, subjects, raters, dates of administration, and so forth. This model looks at the overall reliability of the results.

5. How is the validity of an assessment instrument determined?–,

Validity of assessment instruments requires several sources of evidence to build the case that the instrument measures what it is supposed to measure., Determining validity can be viewed as constructing an evidence-based argument regarding how well a tool measures what it is supposed to do. Evidence can be assembled to support, or not support, a specific use of the assessment tool. Evidence can be found in content, response process, relationships to other variables, and consequences.

Content includes a description of the steps used to develop the instrument. Provide information such as who created the instrument (national experts would confer greater validity than local experts, who in turn would have more validity than nonexperts) and other steps that support the instrument has the appropriate content.

Response process includes information about whether the actions or thoughts of the subjects actually match the test and also information regarding training for the raters/observers, instructions for the test-takers, instructions for scoring, and clarity of these materials.

Relationship to other variables includes correlation of the new assessment instrument results with other performance outcomes that would likely be the same. If there is a previously accepted “gold standard” of measurement, correlate the instrument results to the subject's performance on the “gold standard.” In many cases, no “gold standard” exists and comparison is made to other assessments that appear reasonable (eg, in-training examinations, objective structured clinical examinations, rotation “grades,” similar surveys).

Consequences means that if there are pass/fail or cut-off performance scores, those grouped in each category tend to perform the same in other settings. Also, if lower performers receive additional training and their scores improve, this would add to the validity of the instrument.

Different types of instruments need an emphasis on different sources of validity evidence. For example, for observer ratings of resident performance, interrater agreement may be key, whereas for a survey measuring resident stress, relationship to other variables may be more important. For a multiple choice examination, content and consequences may be essential sources of validity evidence. For high-stakes assessments (eg, board examinations), substantial evidence to support the case for validity will be required.

There are also other types of validity evidence, which are not discussed here.

6. How can researchers enhance the validity of their assessment instruments?

First, do a literature search and use previously developed outcome measures. If the instrument must be modified for use with your subjects or setting, modify and describe how, in a transparent way. Include sufficient detail to allow readers to understand the potential limitations of this approach.

If no assessment instruments are available, use content experts to create your own and pilot the instrument prior to using it in your study. Test reliability and include as many sources of validity evidence as are possible in your paper. Discuss the limitations of this approach openly.

7. What are the expectations of JGME editors regarding assessment instruments used in graduate medical education research?

JGME editors expect that discussions of the validity of your assessment tools will be explicitly mentioned in your manuscript, in the methods section. If you are using a previously studied tool in the same setting, with the same subjects, and for the same purpose, citing the reference(s) is sufficient. Additional discussion about your adaptation is needed if you (1) have modified previously studied instruments; (2) are using the instrument for different settings, subjects, or purposes; or (3) are using different interpretation or cut-off points. Discuss whether the changes are likely to affect the reliability or validity of the instrument.

Researchers who create novel assessment instruments need to state the development process, reliability measures, pilot results, and any other information that may lend credibility to the use of homegrown instruments. Transparency enhances credibility.

In general, little information can be gleaned from single-site studies using untested assessment instruments; these studies are unlikely to be accepted for publication.

Why validity is important in assessment?

Educational assessment should always have a clear purpose, making validity the most important attribute of a good test. The validity of an assessment tool is the extent to which it measures what it was designed to measure, without contamination from other characteristics.

What is the importance of validity and reliability?

The purpose of establishing reliability and validity in research is essentially to ensure that data are sound and replicable, and the results are accurate. The evidence of validity and reliability are prerequisites to assure the integrity and quality of a measurement instrument [Kimberlin & Winterstein, 2008].

What is more important the reliability and validity of a test or the applicant's perception of the test?

The reliability and validity of a test is more important than an applicants perception because even if they have a bad perception of the test and decide to take legal action the tests validity and reliability would help the case.

Which is more important validity or reliability quizlet?

Validity is that research must measure what it purports to measure. This valid research provides truth about the research problem. Reliability is that research must be repeatable with consistent results. Validity is more important though, as it is the highest research standard.