Intra observer reliability is a measure of the extent to which

Test–Retest Reliability

Chong Ho Yu, in Encyclopedia of Social Measurement, 2005

Frequency of Categorical Data

In some tests stability is the only psychometric attribute that can be estimated and thus only test–retest can be applied. The Rorschach test, also known as the Rorschach inkblot test, is a good example. The test is a psychological projective test of personality in which a subject's interpretations of abstract designs are analyzed as a measure of emotional and intellectual functioning and integration. Cronbach pointed out that many Rorschach scores do not have psychometric characteristics that are commonly found in most psychological tests. Nonetheless, researchers who employ the Rorschach test can encode the qualitative-type responses into different categories and frequency counts of the responses can be tracked. Hence, the stability of these tests can be addressed through studies of test–retest reliability. The following three approaches are widely adopted.

Cohen's Kappa

Cohen's Kappa coefficient, which is commonly used to estimate interrater reliability, can be employed in the context of test–retest. In test–retest, the Kappa coefficient indicates the extent of agreement between frequencies of two sets of data collected on two different occasions.

Kendall's Tau

However, Kappa coefficients cannot be estimated when the subject responses are not distributed among all valid response categories. For instance, if subject responses are distributed between “yes” and “no” in the first testing but among “yes,” “no,” and “don't know” in the second testing, then Kappa is considered invalid. In this case, Kendall's Tau, which is a nonparametric measure of association based on the number of concordances and discordances in paired observations, should be used instead. Concordance could be found when paired observations co-vary, but discordance occurs when paired observations do not co-vary.

Yule's Q

Even if the subject responses distribute among all valid categories, researchers are encouraged to go beyond Kappa by computing other categorical-based correlation coefficients for verification or “triangulation.” Yule's Q is a good candidate for its conceptual and computational simplicity. It is not surprising that in some studies Cohen's Kappa and Yule's Q yield substantively different values and the discrepancy drives the researcher to conduct further investigation.

Yule's Q is a measurement of correlation based upon the odds ratio. In Table I, a, b, c, and d represent the frequency counts of two categorical responses, “yes” and “no” recorded on two occasions, “Time 1” and “Time 2.” The odd ratio, by definition, is OR = ad/bc. Yule's Q is defined as Q = [OR − 1]/[OR + 1].

Table I. 2 × 2 Table of Two Measures

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B0123693985000943

Diary Studies

Kathy Baxter, ... John Boyd, in Understanding your Users [Second Edition], 2015

Crowd Sourcing

You may have a taxonomy in mind that you want to use for data organization or analysis. For example, if you are collecting feedback from customers about their experience using your travel app, the categories might be the features of your app [e.g., flight search, car rental, account settings]. If so, you can crowd source the analysis among volunteers within your group/company or people outside your company [e.g., Amazon’s Mechanical Turk]. You would ask volunteers to “tag” each data point with one of the labels from your taxonomy [e.g., “flight search,” “website reliability”]. This works well if the domain is easy for the volunteer to understand and the data points can stand alone [i.e., one does not need to read an entire diary to understand what the participant is reporting].

If you do not have any taxonomy in mind, you can create one by conducting an affinity diagram on a random subset of your data. Because this is a subset, it will likely be incomplete, so allow volunteers the option to say “none of the above” and suggest their own label.

If you have the time and resources, it is best to have more than one volunteer categorize each data point so you can measure interrater reliability [IRR] or interrater agreement. IRR is the degree to which two or more observers assign the same rating or label to a behavior. In this case, it is the amount of agreement between volunteers coding the same data points. If you are analyzing nominal data [see next section], IRR between two coders is typically calculated using Cohen’s kappa and ranges from 1 to − 1, with 1 meaning perfect agreement, 0 meaning completely random agreement, and − 1 meaning perfect disagreement. Refer to Chapter 9, “Interviews” on page 254 to learn more about Cohen’s kappa. Data points with low IRR should be manually reviewed to resolve the disagreement.

Suggested Resources for Additional Reading

There are actually multiple ways to calculate IRR, depending on the type of data and number of raters. To learn more about the types of IRR and how to calculate them, refer to the following handbook:

Gwet, K. L. [2010]. Handbook of inter-rater reliability. Advanced Analytics, LLC, Gaithersburg, MD.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128002322000080

Behavior Assessment System for Children [BASC]: Toward Accurate Diagnosis and Effective Treatment

GAIL S. MATAZOW, R.W. KAMPHAUS, in Handbook of Psychoeducational Assessment, 2001

Reliability of the BASC

The BASC manual [Reynolds & Kamphaus, 1992] provides three types of reliability evidence: internal consistency, test-retest reliability, and interrater reliability. The TRS reliability evidence, as noted in the manual, is as follows: internal consistencies of the scales averaged above .80 for all three age levels; test-retest correlations had median values of .89, .91, and .82, respectively, for the scales at the three age levels; and interrater reliability correlations revealed values of .83 for four pairs of teachers and .63 and .71 for mixed groups of teachers. The PRS reliability evidence, as noted in the manual, is as follows: internal consistencies of the scales were in the middle .80s to low .90s at all three age levels; test-retest correlations had median values of .85, .88, and .70, respectively, for the scales at the three age levels; and interparent reliability correlations revealed values of .46, .57, and .67 at the three age levels. The SRP reliability evidence, as presented in the manual, is as follows: internal consistencies averaged about .80 for each gender at both age levels and test-retest correlations had median values of .76 at each age level. The manual provides documentation of correlations with other instruments of behavioral functioning.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780120585700500112

Clinical Psychology: Validity of Judgment

H.N. Garb, in International Encyclopedia of the Social & Behavioral Sciences, 2001

4 Prediction of Behavior

With regard to predicting behavior, mental health professionals have been able to make reliable and moderately valid judgments. For the prediction of suicide and the prediction of violence, inter-rater reliability has ranged from fair to excellent. However, validity has been poor for the prediction of suicidal behavior [suicidal behavior refers to suicide gestures, suicide attempts, and suicide completions]. For example, in one study [Janofsky et al. 1988], mental health professionals on an acute psychiatry in-patient unit interviewed patients and then rated the likelihood of suicidal behavior in the next seven days. They were not able to predict at a level better than chance. In contrast to the prediction of suicidal behavior, predictions of violence have been more accurate than chance. In fact, though the long-term prediction of violence is commonly believed to be a more difficult task than the short-term prediction of violence, both short- and long-term predictions of violence have been moderately valid. Finally, the validity of clinicians' prognostic ratings has rarely been studied, but there is reason to believe that clinicians' prognostic ratings for patients with schizophrenia may be too pessimistic [patients with schizophrenia may do better than many clinicians expect].

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B0080430767012973

Performance Evaluation in Work Settings

W.C. Borman, in International Encyclopedia of the Social & Behavioral Sciences, 2001

3.1.1 Evaluation of ratings

Ratings of job performance sometimes suffer from psychometric errors such as distributional errors or illusory halo. Distributional errors include leniency/severity where raters evaluate ratees either too high or too low in comparison to their actual performance level. Restriction-in-range is another distributional error. With this error, a rater may rate two or more ratees on a dimension such that the spread [i.e., variance] of these ratings is lower than the variance of the actual performance levels for these ratees [Murphy and Cleveland 1995]. Illusory halo occurs when a rater makes ratings on two or more dimensions such that the correlations between the dimensions are higher than the between-dimension correlations of the actual behaviors relevant to those dimensions [Cooper 1981].

A second common approach for evaluating ratings is to assess interrater reliability, either within rating source [e.g., peers] or across sources [e.g., peers and supervisor]. The notion here is that high interrater agreement implies that the ratings are accurate. Unfortunately, this does not necessarily follow. Raters may agree in their evaluations because they are rating according to ratee reputation or likeability, even though these factors might have nothing to do with actual performance. In addition, low agreement between raters at different organizational levels may result from these raters' viewing different samplings of ratee behavior or having different roles related to the ratees. In this scenario, each level's raters might be providing valid evaluations, but for different elements of job performance [Borman 1997]. Accordingly, assessing the quality of ratings using interrater reliability is somewhat problematic. On balance, however, high interrater reliability is desirable, especially within rating source.

A third approach sometimes suggested for evaluating ratings is to assess their validity or accuracy. The argument made is that rating errors and interrater reliability are indirect ways of estimating what we really want to know. How accurate are the ratings at reflecting actual ratee performance? Unfortunately, evaluating rating accuracy requires comparing them to some kind of ‘true score,’ a characterization of each ratee's actual performance. Because determining true performance scores in a work setting is typically impossible, research on the accuracy of performance ratings has proceeded in the laboratory. To evaluate accuracy, written or videotaped vignettes of hypothetical ratees have been developed, and target performance scores on multiple dimensions have been derived using expert judgment. Ratings of these written vignette or videotaped performers can then be compared to the target scores to derive accuracy scores [Borman 1977].

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B008043076701398X

Inter-Rater Reliability

Robert F. DeVellis, in Encyclopedia of Social Measurement, 2005

Percentage Agreement

One of the simplest ways of judging reliability is percentage agreement [i.e., the percentage of cases that raters classify similarly]. There are at least two ways to do this, and the difference between them is important. The simplest method is to take a simple count of the number of instances of a category that each rater reports. Assume, as in Table I, that each of two raters found 21 instances of verbs among 100 possibilities. Because they agree on the number of instances, 21 in 100, it might appear that they completely agree on the verb score and that the inter-rater reliability is 1.0. This conclusion may be unwarranted, however, because it does not determine whether the 21 events rater #1 judged to be verbs were the same events that rater #2 assigned to that category. The raters have agreed on the rate of occurrence [i.e., 21%] of the category, “verbs” but not necessarily on specific examples of its occurrence. Rater #1 might have rated the first 21 of 100 words of a transcript as verbs, whereas rater #2 might have described the last 21 words as verbs. In such a case, the raters would not have agreed on the categorization of any of the specific words on the transcript. Rather than simply tallying the number of words each rater assigned to a category, the researcher could determine how the raters evaluated the same words. The cross-tabulation shown in Table I accomplishes this by indicating how many words rater #1 classified as verbs, for example, and how many of those same words rater #2 classified similarly. The number of word-by-word agreements for the several categories appear along the main diagonal of the table. This is a more specific indicator of percentage agreement as a measure of inter-rater reliability. In this case, the overall percentage agreement is 73% and, for verbs, it is 10/21 = 47.6%.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B0123693985000955

Interviews

Kathy Baxter, ... Lana Yarosh, in Understanding your Users [Second Edition], 2015

Qualitative Content/Thematic Analysis

Content analysis is a method of sorting, synthesizing, and organizing unstructured textual data from an interview or other source [e.g., survey, screen scrape of online forum]. Content analysis may be done manually [by hand or software-assisted] or using software alone. When done manually [by hand or software-assisted], researchers must read and reread interviewee responses, generate a scheme to categorize the answers, develop rules for assigning responses to the categories, and ensure that responses are reliably assigned to categories. Because these rules are imperfect and interpretation of interviewee responses is difficult, researchers may categorize answers differently. Therefore, to ensure that answers are reliably assigned to categories, two or more researchers must code each interview and a predetermined rate of interrater reliability [e.g., Cohen’s kappa] must be achieved [see How to Calculate Interrater Reliability [Kappa] for details].

Several tools are available for purchase to help you analyze qualitative data. They range in purpose from quantitative, focusing on counting the frequency of specific words or content, to helping you look for patterns or trends in your data. If you would like to learn more about the available tools, what they do, and where to find them, refer to Chapter 8, “Qualitative Analysis Tools” section on page 208.

How to Calculate Interrater Reliability [Kappa]

To determine reliability, you need a measure of interrater reliability [IRR] or interrater agreement. Interrater reliability is the degree to which two or more observers assign the same rating, label, or category to an observation, behavior, or segment of text. In this case, we are interested in the amount of agreement or reliability between volunteers coding the same data points. High reliability indicates that individual biases were minimized and that another researcher using the same categories would likely come to the same conclusion that a segment of text would fit within the specified category.

The simplest form of interrater agreement or reliability is agreements/agreements+disagreements. However, this measure of simple reliability does not take into account agreements due to chance. For example, if we had four categories, we would expect volunteers to agree 25% of the time simply due to chance. To account for agreements due to chance in our reliability analysis, we must use a variation. One common variation is Krippendorff’s Alpha [KALPHA], which does not suffer from some of the limitations of other methods [i.e., sample size, more than two coders, missing data]. De Swert [2012] created a step-by-step manual for how to calculate KALPHA in SPSS. Another very common alternative is Cohen’s kappa, which should be used when you are analyzing nominal data and have two coders/volunteers.

To measure Cohen’s kappa, follow these steps:

Step 1. Organize data into a contingency table. Agreements between coders will increment the diagonal cells and disagreements will increment other cells. For example, since coder 1 and coder 2 agree that observation 1 goes in category A, tally one agreement in the cell A, A. The disagreement in observation 3, on the other hand, goes in cell A, B.

Raw data:

Step 2. Compute the row, column, and overall sums for the table. The row, column, and total should be the same number.

Coder 2
A B C
A 6 2 1 9
Coder 1 B 1 9 1 11
C 1 1 8 10
8 12 10 30

Step 3. Compute the total number of actual agreements by summing the diagonal cells:

∑​agreement=6+9+8=23

Step 4. Compute the expected agreement for each category. For example, to compute the expected agreement for “A,” “A”:

∑​expectedagreement=rowtotal×columntotaloveralltotal

Step 5. Sum the expected agreements due to chance:

∑​expectedagreement= 2.4=4.4=3.3=10.1

Step 6. Compute Cohen’s kappa:

Kappa=∑​ agreement−∑​expectedagreementtotal− ∑​expectedagreement

Step 7. Compare Cohen’s kappa to benchmarks from Landis and Koch [1977]:

Bài Viết Liên Quan

Chủ Đề