Interrater reliability (IRR) statistics, like Cohen’s kappa, measure agreement between raters beyond what is expected by chance when classifying items into categories. While Cohen’s kappa has been widely used, it ha...Interrater reliability (IRR) statistics, like Cohen’s kappa, measure agreement between raters beyond what is expected by chance when classifying items into categories. While Cohen’s kappa has been widely used, it has several limitations, prompting development of Gwet’s agreement statistic, an alternative “kappa”statistic which models chance agreement via an “occasional guessing” model. However, we show that Gwet’s formula for estimating the proportion of agreement due to chance is itself biased for intermediate levels of agreement, despite overcoming limitations of Cohen’s kappa at high and low agreement levels. We derive a maximum likelihood estimator for the occasional guessing model that yields an unbiased estimator of the IRR, which we call the maximum likelihood kappa (κML). The key result is that the chance agreement probability under the occasional guessing model is simply equal to the observed rate of disagreement between raters. The κMLstatistic provides a theoretically principled approach to quantifying IRR that addresses limitations of previous κcoefficients. Given the widespread use of IRR measures, having an unbiased estimator is important for reliable inference across domains where rater judgments are analyzed.展开更多
Considerable research has been conducted on how interrater agreement (IRA) should be established before data can be aggregated from the individual rater level to the organization level in survey research. However, lit...Considerable research has been conducted on how interrater agreement (IRA) should be established before data can be aggregated from the individual rater level to the organization level in survey research. However, little is known about how researchers should treat the observations with low IRA values fail to meet the suggested standard. We seek to answer this question by investigating the impact of two factors (the relationship strength and the overall IRA level of a sample) on the IRA decision. Using both real data from a service industry and simulated data, we find that both factors affect whether a researcher should include or exclude observations with low IRA values. Based on the results, practical guidelines on when to use the entire sample are offered.展开更多
AIM To investigate the inter-and intra-rater reliability of the vertebral fracture classifications used in the Swedish fracture register.METHODS Radiological images of consecutive patients with cervical spine fracture...AIM To investigate the inter-and intra-rater reliability of the vertebral fracture classifications used in the Swedish fracture register.METHODS Radiological images of consecutive patients with cervical spine fractures(n = 50)were classified by 5 raters with different experience levels at two occasions. An identical process was performed with thoracolumbar fractures(n = 50). Cohen's kappa was used to calculate the inter-and intra-rater reliability.RESULTS The mean kappa coefficient for inter-rater reliability ranged between 0.54 and0.79 for the cervical fracture classifications, between 0.51 and 0.72 for the thoracolumbar classifications(overall and for different sub classifications), and between 0.65 and 0.77 for the presence or absence of signs of ankylosing disorder in the fracture area. The mean kappa coefficient for intra-rater reliability ranged between 0.58 and 0.80 for the cervical fracture classifications, between 0.46 and0.68 for the thoracolumbar fracture classifications(overall and for different sub classifications) and between 0.79 and 0.81 for the presence or absence of signs of ankylosing disorder in the fracture area.CONCLUSION The classifications used in the Swedish fracture register for vertebral fractures have an acceptable inter-and intra-rater reliability with a moderate strength of agreement.展开更多
文摘Interrater reliability (IRR) statistics, like Cohen’s kappa, measure agreement between raters beyond what is expected by chance when classifying items into categories. While Cohen’s kappa has been widely used, it has several limitations, prompting development of Gwet’s agreement statistic, an alternative “kappa”statistic which models chance agreement via an “occasional guessing” model. However, we show that Gwet’s formula for estimating the proportion of agreement due to chance is itself biased for intermediate levels of agreement, despite overcoming limitations of Cohen’s kappa at high and low agreement levels. We derive a maximum likelihood estimator for the occasional guessing model that yields an unbiased estimator of the IRR, which we call the maximum likelihood kappa (κML). The key result is that the chance agreement probability under the occasional guessing model is simply equal to the observed rate of disagreement between raters. The κMLstatistic provides a theoretically principled approach to quantifying IRR that addresses limitations of previous κcoefficients. Given the widespread use of IRR measures, having an unbiased estimator is important for reliable inference across domains where rater judgments are analyzed.
文摘Considerable research has been conducted on how interrater agreement (IRA) should be established before data can be aggregated from the individual rater level to the organization level in survey research. However, little is known about how researchers should treat the observations with low IRA values fail to meet the suggested standard. We seek to answer this question by investigating the impact of two factors (the relationship strength and the overall IRA level of a sample) on the IRA decision. Using both real data from a service industry and simulated data, we find that both factors affect whether a researcher should include or exclude observations with low IRA values. Based on the results, practical guidelines on when to use the entire sample are offered.
文摘AIM To investigate the inter-and intra-rater reliability of the vertebral fracture classifications used in the Swedish fracture register.METHODS Radiological images of consecutive patients with cervical spine fractures(n = 50)were classified by 5 raters with different experience levels at two occasions. An identical process was performed with thoracolumbar fractures(n = 50). Cohen's kappa was used to calculate the inter-and intra-rater reliability.RESULTS The mean kappa coefficient for inter-rater reliability ranged between 0.54 and0.79 for the cervical fracture classifications, between 0.51 and 0.72 for the thoracolumbar classifications(overall and for different sub classifications), and between 0.65 and 0.77 for the presence or absence of signs of ankylosing disorder in the fracture area. The mean kappa coefficient for intra-rater reliability ranged between 0.58 and 0.80 for the cervical fracture classifications, between 0.46 and0.68 for the thoracolumbar fracture classifications(overall and for different sub classifications) and between 0.79 and 0.81 for the presence or absence of signs of ankylosing disorder in the fracture area.CONCLUSION The classifications used in the Swedish fracture register for vertebral fractures have an acceptable inter-and intra-rater reliability with a moderate strength of agreement.