Intra observer reliability is a measure of the extent to which

Test–Retest Reliability

Chong Ho Yu, in Encyclopedia of Social Measurement, 2005

Frequency of Categorical Data

In some tests stability is the only psychometric attribute that can be estimated and thus only test–retest can be applied. The Rorschach test, also known as the Rorschach inkblot test, is a good example. The test is a psychological projective test of personality in which a subject's interpretations of abstract designs are analyzed as a measure of emotional and intellectual functioning and integration. Cronbach pointed out that many Rorschach scores do not have psychometric characteristics that are commonly found in most psychological tests. Nonetheless, researchers who employ the Rorschach test can encode the qualitative-type responses into different categories and frequency counts of the responses can be tracked. Hence, the stability of these tests can be addressed through studies of test–retest reliability. The following three approaches are widely adopted.

Cohen's Kappa

Cohen's Kappa coefficient, which is commonly used to estimate interrater reliability, can be employed in the context of test–retest. In test–retest, the Kappa coefficient indicates the extent of agreement between frequencies of two sets of data collected on two different occasions.

Kendall's Tau

However, Kappa coefficients cannot be estimated when the subject responses are not distributed among all valid response categories. For instance, if subject responses are distributed between “yes” and “no” in the first testing but among “yes,” “no,” and “don't know” in the second testing, then Kappa is considered invalid. In this case, Kendall's Tau, which is a nonparametric measure of association based on the number of concordances and discordances in paired observations, should be used instead. Concordance could be found when paired observations co-vary, but discordance occurs when paired observations do not co-vary.

Yule's Q

Even if the subject responses distribute among all valid categories, researchers are encouraged to go beyond Kappa by computing other categorical-based correlation coefficients for verification or “triangulation.” Yule's Q is a good candidate for its conceptual and computational simplicity. It is not surprising that in some studies Cohen's Kappa and Yule's Q yield substantively different values and the discrepancy drives the researcher to conduct further investigation.

Yule's Q is a measurement of correlation based upon the odds ratio. In Table I, a, b, c, and d represent the frequency counts of two categorical responses, “yes” and “no” recorded on two occasions, “Time 1” and “Time 2.” The odd ratio, by definition, is OR = ad/bc. Yule's Q is defined as Q = (OR − 1)/(OR + 1).

Table I. 2 × 2 Table of Two Measures

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000943

Diary Studies

Kathy Baxter, ... John Boyd, in Understanding your Users (Second Edition), 2015

Crowd Sourcing

You may have a taxonomy in mind that you want to use for data organization or analysis. For example, if you are collecting feedback from customers about their experience using your travel app, the categories might be the features of your app (e.g., flight search, car rental, account settings). If so, you can crowd source the analysis among volunteers within your group/company or people outside your company (e.g., Amazon’s Mechanical Turk). You would ask volunteers to “tag” each data point with one of the labels from your taxonomy (e.g., “flight search,” “website reliability”). This works well if the domain is easy for the volunteer to understand and the data points can stand alone (i.e., one does not need to read an entire diary to understand what the participant is reporting).

If you do not have any taxonomy in mind, you can create one by conducting an affinity diagram on a random subset of your data. Because this is a subset, it will likely be incomplete, so allow volunteers the option to say “none of the above” and suggest their own label.

If you have the time and resources, it is best to have more than one volunteer categorize each data point so you can measure interrater reliability (IRR) or interrater agreement. IRR is the degree to which two or more observers assign the same rating or label to a behavior. In this case, it is the amount of agreement between volunteers coding the same data points. If you are analyzing nominal data (see next section), IRR between two coders is typically calculated using Cohen’s kappa and ranges from 1 to − 1, with 1 meaning perfect agreement, 0 meaning completely random agreement, and − 1 meaning perfect disagreement. Refer to Chapter 9, “Interviews” on page 254 to learn more about Cohen’s kappa. Data points with low IRR should be manually reviewed to resolve the disagreement.

Suggested Resources for Additional Reading

There are actually multiple ways to calculate IRR, depending on the type of data and number of raters. To learn more about the types of IRR and how to calculate them, refer to the following handbook:

Gwet, K. L. (2010). Handbook of inter-rater reliability. Advanced Analytics, LLC, Gaithersburg, MD.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128002322000080

Behavior Assessment System for Children (BASC): Toward Accurate Diagnosis and Effective Treatment

GAIL S. MATAZOW, R.W. KAMPHAUS, in Handbook of Psychoeducational Assessment, 2001

Reliability of the BASC

The BASC manual (Reynolds & Kamphaus, 1992) provides three types of reliability evidence: internal consistency, test-retest reliability, and interrater reliability. The TRS reliability evidence, as noted in the manual, is as follows: internal consistencies of the scales averaged above .80 for all three age levels; test-retest correlations had median values of .89, .91, and .82, respectively, for the scales at the three age levels; and interrater reliability correlations revealed values of .83 for four pairs of teachers and .63 and .71 for mixed groups of teachers. The PRS reliability evidence, as noted in the manual, is as follows: internal consistencies of the scales were in the middle .80s to low .90s at all three age levels; test-retest correlations had median values of .85, .88, and .70, respectively, for the scales at the three age levels; and interparent reliability correlations revealed values of .46, .57, and .67 at the three age levels. The SRP reliability evidence, as presented in the manual, is as follows: internal consistencies averaged about .80 for each gender at both age levels and test-retest correlations had median values of .76 at each age level. The manual provides documentation of correlations with other instruments of behavioral functioning.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780120585700500112

Clinical Psychology: Validity of Judgment

H.N. Garb, in International Encyclopedia of the Social & Behavioral Sciences, 2001

4 Prediction of Behavior

With regard to predicting behavior, mental health professionals have been able to make reliable and moderately valid judgments. For the prediction of suicide and the prediction of violence, inter-rater reliability has ranged from fair to excellent. However, validity has been poor for the prediction of suicidal behavior (suicidal behavior refers to suicide gestures, suicide attempts, and suicide completions). For example, in one study (Janofsky et al. 1988), mental health professionals on an acute psychiatry in-patient unit interviewed patients and then rated the likelihood of suicidal behavior in the next seven days. They were not able to predict at a level better than chance. In contrast to the prediction of suicidal behavior, predictions of violence have been more accurate than chance. In fact, though the long-term prediction of violence is commonly believed to be a more difficult task than the short-term prediction of violence, both short- and long-term predictions of violence have been moderately valid. Finally, the validity of clinicians' prognostic ratings has rarely been studied, but there is reason to believe that clinicians' prognostic ratings for patients with schizophrenia may be too pessimistic (patients with schizophrenia may do better than many clinicians expect).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0080430767012973

Performance Evaluation in Work Settings

W.C. Borman, in International Encyclopedia of the Social & Behavioral Sciences, 2001

3.1.1 Evaluation of ratings

Ratings of job performance sometimes suffer from psychometric errors such as distributional errors or illusory halo. Distributional errors include leniency/severity where raters evaluate ratees either too high or too low in comparison to their actual performance level. Restriction-in-range is another distributional error. With this error, a rater may rate two or more ratees on a dimension such that the spread (i.e., variance) of these ratings is lower than the variance of the actual performance levels for these ratees (Murphy and Cleveland 1995). Illusory halo occurs when a rater makes ratings on two or more dimensions such that the correlations between the dimensions are higher than the between-dimension correlations of the actual behaviors relevant to those dimensions (Cooper 1981).

A second common approach for evaluating ratings is to assess interrater reliability, either within rating source (e.g., peers) or across sources (e.g., peers and supervisor). The notion here is that high interrater agreement implies that the ratings are accurate. Unfortunately, this does not necessarily follow. Raters may agree in their evaluations because they are rating according to ratee reputation or likeability, even though these factors might have nothing to do with actual performance. In addition, low agreement between raters at different organizational levels may result from these raters' viewing different samplings of ratee behavior or having different roles related to the ratees. In this scenario, each level's raters might be providing valid evaluations, but for different elements of job performance (Borman 1997). Accordingly, assessing the quality of ratings using interrater reliability is somewhat problematic. On balance, however, high interrater reliability is desirable, especially within rating source.

A third approach sometimes suggested for evaluating ratings is to assess their validity or accuracy. The argument made is that rating errors and interrater reliability are indirect ways of estimating what we really want to know. How accurate are the ratings at reflecting actual ratee performance? Unfortunately, evaluating rating accuracy requires comparing them to some kind of ‘true score,’ a characterization of each ratee's actual performance. Because determining true performance scores in a work setting is typically impossible, research on the accuracy of performance ratings has proceeded in the laboratory. To evaluate accuracy, written or videotaped vignettes of hypothetical ratees have been developed, and target performance scores on multiple dimensions have been derived using expert judgment. Ratings of these written vignette or videotaped performers can then be compared to the target scores to derive accuracy scores (Borman 1977).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B008043076701398X

Inter-Rater Reliability

Robert F. DeVellis, in Encyclopedia of Social Measurement, 2005

Percentage Agreement

One of the simplest ways of judging reliability is percentage agreement (i.e., the percentage of cases that raters classify similarly). There are at least two ways to do this, and the difference between them is important. The simplest method is to take a simple count of the number of instances of a category that each rater reports. Assume, as in Table I, that each of two raters found 21 instances of verbs among 100 possibilities. Because they agree on the number of instances, 21 in 100, it might appear that they completely agree on the verb score and that the inter-rater reliability is 1.0. This conclusion may be unwarranted, however, because it does not determine whether the 21 events rater #1 judged to be verbs were the same events that rater #2 assigned to that category. The raters have agreed on the rate of occurrence (i.e., 21%) of the category, “verbs” but not necessarily on specific examples of its occurrence. Rater #1 might have rated the first 21 of 100 words of a transcript as verbs, whereas rater #2 might have described the last 21 words as verbs. In such a case, the raters would not have agreed on the categorization of any of the specific words on the transcript. Rather than simply tallying the number of words each rater assigned to a category, the researcher could determine how the raters evaluated the same words. The cross-tabulation shown in Table I accomplishes this by indicating how many words rater #1 classified as verbs, for example, and how many of those same words rater #2 classified similarly. The number of word-by-word agreements for the several categories appear along the main diagonal of the table. This is a more specific indicator of percentage agreement as a measure of inter-rater reliability. In this case, the overall percentage agreement is 73% and, for verbs, it is 10/21 = 47.6%.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000955

Interviews

Kathy Baxter, ... Lana Yarosh, in Understanding your Users (Second Edition), 2015

Qualitative Content/Thematic Analysis

Content analysis is a method of sorting, synthesizing, and organizing unstructured textual data from an interview or other source (e.g., survey, screen scrape of online forum). Content analysis may be done manually (by hand or software-assisted) or using software alone. When done manually (by hand or software-assisted), researchers must read and reread interviewee responses, generate a scheme to categorize the answers, develop rules for assigning responses to the categories, and ensure that responses are reliably assigned to categories. Because these rules are imperfect and interpretation of interviewee responses is difficult, researchers may categorize answers differently. Therefore, to ensure that answers are reliably assigned to categories, two or more researchers must code each interview and a predetermined rate of interrater reliability (e.g., Cohen’s kappa) must be achieved (see How to Calculate Interrater Reliability [Kappa] for details).

Several tools are available for purchase to help you analyze qualitative data. They range in purpose from quantitative, focusing on counting the frequency of specific words or content, to helping you look for patterns or trends in your data. If you would like to learn more about the available tools, what they do, and where to find them, refer to Chapter 8, “Qualitative Analysis Tools” section on page 208.

How to Calculate Interrater Reliability (Kappa)

To determine reliability, you need a measure of interrater reliability (IRR) or interrater agreement. Interrater reliability is the degree to which two or more observers assign the same rating, label, or category to an observation, behavior, or segment of text. In this case, we are interested in the amount of agreement or reliability between volunteers coding the same data points. High reliability indicates that individual biases were minimized and that another researcher using the same categories would likely come to the same conclusion that a segment of text would fit within the specified category.

The simplest form of interrater agreement or reliability is agreements/agreements+disagreements. However, this measure of simple reliability does not take into account agreements due to chance. For example, if we had four categories, we would expect volunteers to agree 25% of the time simply due to chance. To account for agreements due to chance in our reliability analysis, we must use a variation. One common variation is Krippendorff’s Alpha (KALPHA), which does not suffer from some of the limitations of other methods (i.e., sample size, more than two coders, missing data). De Swert (2012) created a step-by-step manual for how to calculate KALPHA in SPSS. Another very common alternative is Cohen’s kappa, which should be used when you are analyzing nominal data and have two coders/volunteers.

To measure Cohen’s kappa, follow these steps:

Step 1. Organize data into a contingency table. Agreements between coders will increment the diagonal cells and disagreements will increment other cells. For example, since coder 1 and coder 2 agree that observation 1 goes in category A, tally one agreement in the cell A, A. The disagreement in observation 3, on the other hand, goes in cell A, B.

Raw data:

Intra observer reliability is a measure of the extent to which

Step 2. Compute the row, column, and overall sums for the table. The row, column, and total should be the same number.

Coder 2
A B C
A 6 2 1 9
Coder 1 B 1 9 1 11
C 1 1 8 10
8 12 10 30

Step 3. Compute the total number of actual agreements by summing the diagonal cells:

∑​agreement=6+9+8=23

Step 4. Compute the expected agreement for each category. For example, to compute the expected agreement for “A,” “A”:

∑​expectedagreement=rowtotal×columntotaloveralltotal

Step 5. Sum the expected agreements due to chance:

∑​expectedagreement= 2.4=4.4=3.3=10.1

Step 6. Compute Cohen’s kappa:

Kappa=∑​ agreement−∑​expectedagreementtotal− ∑​expectedagreement

Step 7. Compare Cohen’s kappa to benchmarks from Landis and Koch (1977):

< 0.00 Poor
0.00-0.20 Slight
0.21-0.40 Fair
0.41-0.6 Moderate
0.61-0.80 Substantial
0.81-1.00 Almost perfect

In our example, Cohen’s kappa of 0.746 falls within the “substantial” benchmark category. Generally, Cohen’s kappas of greater than 0.7 are considered acceptable.

Step 8. Resolve disagreements among coders by having coders discuss the reasons they disagreed and come to an agreement.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128002322000092

Split-Half Reliability

Robert L. Johnson, James Penny, in Encyclopedia of Social Measurement, 2005

Applications

To this point in the article, the discussion of the split-half method has focused on its application in the estimation of reliability based on a set of items administered as a single test form. Such a reliability estimate is completed to determine if results would be similar if students took a test composed of similar items. Cronbach indicated the purpose was to predict the correlation between two equivalent whole tests with two halves of a test. Thus, the method focuses on the consistency of performance across parallel sets of items.

An extension of the split-half method is seen in the estimation of interrater reliability. The scoring of constructed-response items, such as essays or portfolios, generally is completed by two raters. The correlation of one rater's scores with another rater's scores estimates the reliability scores based on a single rater. However, the reported score is often the average of the two scores assigned by the raters. If raters and observers are conceptualized as forms of a test (e.g., Feldt and Brennan), then pooling ratings is analogous to lengthening a test. Thus, the application of the Spearman–Brown prophesy formula to estimate the reliability for scores based on two raters is appropriate. As equivalence is assumed in the application of the split-half reliability estimation, the use of Spearman–Brown prophesy formula to estimate the reliability of scores based on two raters requires that the raters are equally qualified and producing ratings of similar means and variances. The addition of less qualified raters can weaken the reliability of the ratings.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000967

Measuring Emotions

Jason Matthew Harley, in Emotions, Technology, Design, and Learning, 2016

Facial Expressions

Facial expressions are configurations of different micromotor (small muscle) movements in the face that are used to infer a person’s discrete emotional state (e.g., happiness, anger). Ekman and Friesen’s facial action coding system (FACS) was the first widely used and empirically validated approach to classifying a person’s emotional state from their facial expressions (Ekman, 1992; Ekman & Friesen, 1978).

An example of facial expressions being used by human coders to classify students’ nonbasic emotions comes from Craig, D’Mello, Witherspoon, and Graesser (2008) and D’Mello and Graesser (2010), who had two trained coders classify participants’ emotional states, while they viewed videos of participants interacting with AutoTutor, a CBLE designed to foster students’ comprehension of physics and computer literacy. They developed their coding scheme by reducing the set of action units from FACS (used to code facial expressions) to those that they judged relevant to classifying learner-centered emotions, such as boredom, confusion, and frustration. The interrater reliability of trained coders using this coding framework is good for time points selected by the coders for emotions that occurred with sufficient frequency (overall κ = 0.49; boredom, κ = 0.44; confusion, κ = 0.59; delight, κ = 0.58; frustration, κ = 0.37; neutral, κ = 0.31; D’Mello & Graesser, 2010). Judges interrater reliability scores were, however, much lower for evaluations of emotions at preselected, fixed points (for emotions that occurred with sufficient frequency; overall κ = 0.31; boredom, κ = 0.25; confusion, κ = 0.36; flow, κ = 0.30; frustration, κ = 0.27; D’Mello & Graesser, 2010). Although less than ideal, these kappa values are nonetheless common (Baker, D’Mello, Rodrigo, & Graesser, 2010; Porayska-Pomsta et al., 2013) and point to the difficulty of classifying participants’ learning-centered emotions, especially at preselected intervals where little affective information is available. For this reason, most emotional coding systems that use facial expressions classify either only facial features (e.g., eyebrow movement) or basic emotions (Calvo & D’Mello, 2010; Zeng et al., 2009), or combinations of facial features and vocalizations.

One of the relatively new and promising trends in using facial expressions to classify learners’ emotions is the development and use of software programs that automate the process of coding using advanced machine learning technologies (Grafsgaard, Wiggins, Boyer, Wiebe, & Lester, 2014; Harley, Bouchet, & Azevedo, 2013; Harley, Bouchet, Hussain, Azevedo, & Calvo, 2015). For example, FaceReader (5.0) is a commercially available facial recognition program that uses an active appearance model to model participant faces and identifies their facial expressions. The program further utilizes an artificial neural network, with seven outputs to classify learners’ emotional states according to six basic emotions, in addition to “neutral.” Harley et al. (2012, 2013, 2015) have conducted research with FaceReader to: (1) examine students’ emotions at different points in time over the learning session with MetaTutor (Azevedo et al., 2012; Azevedo et al., 2013); (2) investigate the occurrence of co-occurring or “mixed” emotional states; and (3) examine the degree of correspondence between facial expressions, skin conductance (i.e., electrodermal activity), and self-reports of emotional states, while learning with MetaTutor by aligning and comparing these methods.

Although automated facial recognition programs are able to analyze facial expressions much faster than human coders, they are not yet as accurate (Calvo & D’Mello, 2010; Terzis, Moridis, & Economides, 2010; Zeng et al., 2009). The accuracy of automatic facial expression programs varies, both by individual emotion and software program (similar to variance between studies of human coders). An important issue pertaining to automatic facial expression recognition software, especially commercial software, is its continuous evolution, including larger training databases and the inclusion of more naturalistic (non-posed or experimentally induced) emotion data (Zeng et al., 2009).

Less sophisticated, partial facial expression recognition programs are also used in research with CBLEs, such as the Blue Eyes camera system, that is able to detect specific facial features and motions (Arroyo et al., 2009; Burleson, 2011; Kapoor, Burleson, & Picard, 2007). These programs are used differently than fully developed automated or human facial coding programs because they do not provide discrete emotional labels. Instead, the facial features they provide are combined with other physiological (e.g., electrodermal activity, EDA) and behavioral data (e.g., posture) to create sets of predictive features (e.g., spike of arousal, leaning forward, eyebrows raised) that are correlated with other measures of emotions, such as self-report instruments, to validate their connection to different emotions or emotional dimensions. Studies that investigate different emotion detection methods and the conclusions we can draw from them regarding the value of individual methods are discussed later.

In summary, facial expressions have numerous advantages as a method for measuring emotional states. For one, they are the most traditional and remain one of the best measures of emotional states in terms of their widespread use and reliability (tested with multiple raters and with other methods of emotions) that is unmatched by most other methods that are more newly developed (see Calvo & D’Mello, 2010; Zeng et al., 2009). Furthermore, facial expressions can be analyzed in real-time using software programs, such as FaceReader and the Computer Expression Recognition Toolbox (CERT; Grafsgaard et al., 2014) or after the experimental session concludes, using human coders (Craig et al., 2008). Options to detect emotions in real-time, such as automatic facial recognition software, also make facial expressions a viable channel to provide information to the CBLE about the learners’ emotional state, which can in turn be used to provide emotionally supportive prompts (these environments are henceforth referred to as emotionally supportive CBLEs). Finally, and like most of the methods discussed in this chapter, facial expression recognition measures are online measures of emotion that capture the expression of an emotion as it occurs and therefore mitigates the shortcomings of offline self-report measures (discussed in detail below).

The disadvantages of using facial expressions to measure emotions are that most facial expression coding schemes rely on the FACS system traditionally used to classify only the six basic emotions, and are very labor-intensive if done by trained human coders rather than software (Calvo & D’Mello, 2010). Programs of research that use facial expressions to examine nonbasic emotions (e.g., D’Mello & Graesser, 2013; Grafsgaard et al., 2014; Rodrigo & Baker, 2011) require extensive cross-method validations to connect configurations of facial expressions with new emotional labels (e.g., engagement, frustration, boredom). Ultimately, facial expression research is currently best suited to examining basic emotions and most efficiently done when using automatic facial recognition programs, which are continuing to improve and approach levels of classification accuracy similar to human coders (Calvo & D’Mello, 2010; Zeng et al., 2009).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128018569000050

Observational Studies

Anastasia Kitsantas, ... Panagiota Kitsantas, in Encyclopedia of Social Measurement, 2005

Measurement

Measurement is the means by which replication and comparison become possible in observational studies. Measurement here is complex. The complexities relate to obtaining reliable, accurate, and valid observation. Such observation requires care in developing detailed observation manuals, and in selecting, training, and monitoring observers.

The Observation Process: Training and Monitoring Observers

Observing is a human process. The information about behavior that is essential in the social sciences depends heavily on the observing person. Observational studies require the careful selection of observers and their careful training in using observational forms and in recording and evaluating behavior to acceptable levels of agreement and accuracy. Thus, coding schemes must be clearly defined and articulated. Some researchers call for developing an observational manual (see Hartman) that provides explicit definitions and examples assisting the observers in making accurate evaluations and preventing them from drawing unnecessary inferences. Training observers with observations of videotapes accompanied by use of the observational forms and coding schemes is a means of achieving desired levels of rater agreement or reliability. A level of interrater reliability of at least 0.70 has been recommended. Accuracy and reliability are enhanced through monitoring observers during actual data collection.

Agreement, Reliability, and Validity

The criteria for accurate, reliable, and valid observation are of critical importance. Generally, these relate to minimizing error in recording behavior. Traditionally, reliability has been defined as the degree to which data are consistent. However, reliability in observational research is greatly influenced by the continuous interaction between the observer and the recording system. This influence leads to obtaining between and within-observer consistency. In between-observer consistency (or interobserver agreement or interobserver reliability), two or more observers observe the same event. In within- observer consistency (or intraobserver agreement or intraobserver reliability), the same observer observes the behavior of the same subjects under the same conditions. However, there is not universal agreement on the conceptual definitions of these terms.

Methods of computing interobserver agreement include percentages, occurrence agreement (nonoccurrence), and the statistic Kappa. Overall, these indices describe the degree of agreement between observers. A researcher who obtains a high agreement index may conclude that the observers are interchangeable, and that data from any of the observers are free from measurement error. In regard to intraobserver reliability, problems arise from the fact that identical behaviors and conditions cannot be replicated to estimate the consistency of an observer. Pearson's r could be used to compute interobserver reliability, if the classical parallel test assumptions are met between two or more observers. Those assumptions require that two observers produce the same means and standard deviations. Suen argues that this could be possible if the observers are matched (e.g., gender, education, and training). Finally, the intraclass correlation, which does not require the classical parallel assumptions, can be used as an alternative to the intraobserver reliability index and interobserver agreement indices. The approach estimates through analysis of variance the error from different sources.

Validity in observational studies requires that the observation procedure measure the behavior it purports to measure. An important aspect of validity is the accuracy with which the observational protocols measure the behavior of interest. Accuracy refers to the extent to which there is correspondence between an observer's scores and the criterion (e.g., scores from an expert observer). Other kinds of validity in addition to accuracy are required in observational research. Generally, validity in research centers on content, criterion, and construct validity. Content validity refers to behaviors that are to be observed based on the theory that represents a construct. Criterion validity can be established by determining what behaviors relate to the concept under study. The source of those behaviors is the relevant theory of interest to the researcher. Finally, when a set of behaviors are produced deductively from the theory, and postulated to measure a construct, then validation of the construct is necessary.

Improvement in the reliability and validity of an observational study can be accomplished with the aid of technologies. Technologies such as audio recorders, video recorders and computer programs may result in higher reliability and validity by allowing the observer to review (play back) the event many times. This also allows the observer to record several different behaviors that occur closely together and the opportunity of having more than one observer, thereby increasing interobserver reliability.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000104

What is intra observer reliability?

intra-observer (or within observer) reliability; the degree to which measurements taken by the same observer are consistent, • inter-observer (or between observers) reliability; the degree to which measurements taken by different observers are similar.

What does internal consistency reliability measure?

Internal consistency reliability is a measure of how well a test addresses different constructs and delivers reliable scores. The test-retest method involves administering the same test, after a period of time, and comparing the results.

What is intra subject reliability?

As can be easily extrapolated from the previous definitions, intra subject reliability refers to the reproducibility of the identical responses (answers) to a variety of stimuli (items) by a single subject in two or more trials. Of course, this attribute is relevant to relatively permanent convergent skills.

What is an example of inter

Examiners marking school and university exams are assessed on a regular basis, to ensure that they all adhere to the same standards. This is the most important example of interobserver reliability - it would be extremely unfair to fail an exam because the observer was having a bad day.