等级反应多水平评分者漂移模型的构建

Formulation of Grade Response Multilevel DRIFT Model

导出

摘要对于评定耗时较长的测验来说,时间因素对评分精确性的影响不容忽视。研究提出了可用于检测评分者漂移效应的等级反应多水平评分者漂移模型,并通过模拟研究对模型性能进行验证。结果表明:模型能够准确估计项目和能力参数;随机效应模型较固定效应模型更能有效地检测出评分者漂移效应,其有效性和稳定性更佳,可进一步验证其在实际情境中的适用性,并进一步发展成含预测变量的全模型。 Recently, both national and international educational assessment programs have highlighted the usefulness of constructed-response item （CR item）. Accordingly, the rater effect should be of concern to us. The grade response multilevel facets model （GR-MLFM）, which is proposed by Kang, Sun and Zeng （2016）, is used to detect the rater effect, and simulations demonstrate that GR-MLFM can not only estimate the item and person parameters precisely, but also detect the rater effect efficiently. GR-MLFM integrates the advantages of the many facets Rasch model, the multilevel random coefficient model, and the grade response model. Like other multi-facet models however, GR-MLFM also regards the rater effect merely as a static effect, which means for GR-MLFM that only the overall rater effect can be obtained across the time stages, while the specific rater effect for each time stage is not observed. In fact, when the rating task takes place over the period of several hours or several days, concern may arise about the comparability of the ratings both between and within raters over time （Wolfe, Moulder, ＆ Myford, 2001）. This phenomenon is called DRIFT that means differential rater fimctioning over time. Myford and Wolfe （2009） developed the separate model （SM）, based on the many facets Rasch model that includes the time variable as one of the facets of the model, to estimate the specific rater effect for each time stage. With this model, we can obtain the separated severity and the change trend of severity for each rater. Nevertheless, SM cannot find out the factors that affect the rater drift. There are many studies that also aimed at detecting the drift effect with the generalizability theory or other item response models. However, these methods or models can only detect the rater drift effect. They cannot reveal the affected factors behind the rater drift. In order to detect the rater drift and simultaneously find out the factors that can affect the drift, the authors try to construct a model dealing with this situation. We name it the grade response multilevel DRIFT model （GR-MLDM）. This model extends the GR-MLFM and inherits the concepts of SM, therefore combines the advantages that derive from both GR-MLFM and SM. Two simulation studies using the rater fixed effect model and the rater random effect model respectively are conducted to evaluate the reasonableness and the feasibility of GR-MLDM. For the fixed effect model, there are four types of parameters （discrimination, difficulty, ability, and rater severity over time） and no interaction effect between raters and time stages, which means that the overall severity of raters will remain the same across time. As for the random effect model, interactions between raters and time stages are allowed. Therefore, the dynamic ＂rater drift＂ can be detected. The results show that：（1） GR-MLDM, both the fixed effect model and the random effect model, can estimate the item and person parameters precisely. （2） Compared with the fixed effect model, the random effect model can detect the rater drift more precisely; furthermore, due to the interaction between the raters and time stages, the random effect model appears to be more suitable for large-scale assessments. For further investigation, we will apply GR-MLDM to real situations so as to verify the applicability of this model. In addition, predictors can also be added to the random model to form the full model, which can evaluate factors that affect the rater drift in practice.

作者顾士伟曾平飞孙小坚康春花

机构地区浙江师范大学教师教育学院北京师范大学中国基础教育质量监测协同创新中心

出处《心理科学》 CSSCI CSCD 北大核心 2018年第1期196-203,共8页 Journal of Psychological Science

基金浙江省自然科学基金(LY15C090003) 教育部人文社会科学研究规划基金(16YJA190002)的资助

关键词评分者漂移评分者漂移模型固定效应随机效应 differential rater functioning over time, grade response multilevel DRIFT model, fixed effect, random effect

分类号 B841 [哲学宗教—基础心理学]

引文网络
相关文献

参考文献3

1康春花,孙小坚,曾平飞.基于等级反应模型的多水平多侧面评分者模型[J].心理科学,2016,39(1):214-223. 被引量：5
2康春花,孙小坚,曾平飞.等级反应多水平侧面模型及其在主观题评分中的应用[J].心理科学,2017,40(6):1483-1490. 被引量：3
3孙小坚,康春花,曾平飞,辛涛.建构反应题中能力估计准确性的影响因素:评分者人数和项目个数的交互作用[J].心理学探新,2018,38(1):73-79. 被引量：1

二级参考文献37

1纪凌开.分部评分模型与其它几种多级模型的比较[J].心理科学,2004,27(4):1000-1001. 被引量：7
2戴海崎,简小珠.被试作答的偶然性对IRT能力估计的影响研究[J].心理科学,2005,28(6):1433-1436. 被引量：6
3田清源.主观评分中多面Rasch模型的应用[J].心理学探新,2006,26(1):70-73. 被引量：17
4Andrich, D. (1995). Distinctive and incompatible properties of two common classes of IRT models for graded responses. Applied Psychological Measurement, 19(1), 101-119.
5Barkaoui, K. (2010). Explaining ESL essay holistic scores: A multilevel modeling approach. Language Testing, 27(4), 515-535.
6Cook, K. F., Dodd, B. G., & Fitzpatrick, S. J. (1999). A comparison of three polytomous item response theory models in the context of testlet scoring. Journal of Outcome Measurement, 3(1), 1-20,.
7De Boeck, P., & Wilson, M. (2004). A framework for item response models. In P. De Boeck & M. Wilson (Eds.),Explanatory item response models (pp. 3-41). New York:Springer.
8Farrokhi, F., & Esfandiari, R. (2011). A many-facet Rasch model to detect halo effect in three types of raters. Theory and Practice in Language Studies, 1(11), 1531-1540.
9Fishman, G. S. (1972). Bias considerations in simulation experiments. Operations Research, 20(4), 785-790.
10Guo, S. (2014). Correction of rater effects in longitudinal research with a cross- classified random effects model. Applied Psycbological Measurement, 38(1), 37-60.

共引文献5

1康春花,孙小坚,顾士伟,曾平飞.多水平多维IRT模型在学业质量监测中的应用[J].江西师范大学学报（自然科学版）,2016,40(2):133-139. 被引量：1
2孙小坚,康春花,曾平飞,辛涛.建构反应题中能力估计准确性的影响因素:评分者人数和项目个数的交互作用[J].心理学探新,2018,38(1):73-79. 被引量：1
3李健,宋乃庆,王诗梦,孙小坚.一项工具开发:如何才能测评学生美术素养?[J].华东师范大学学报（教育科学版）,2023,41(6):118-132. 被引量：7
4霍紫莹.国内大规模考试作文自动评分:挑战与路向[J].教育与考试,2025(2):22-25. 被引量：2
5郭东威,朱英明,丁根宏.基于交叉评阅及专家赋权的大规模创新型竞赛评审方案[J].系统工程理论与实践,2026,46(1):385-400.

1程迎平.浅谈基于以学为用的初中英语教学实践[J].中国校外教育,2017(12):83-84. 被引量：1
2康春花,孙小坚,曾平飞.等级反应多水平侧面模型及其在主观题评分中的应用[J].心理科学,2017,40(6):1483-1490. 被引量：3
3赵翔,徐江,刘博欣.后勤装备维修保障能力参数体系研究[J].兵器装备工程学报,2017,38(9):136-139. 被引量：6
4张敬琳.外商直接投资对智慧城市发展影响研究[J].当代经济,2017,34(31):4-6.
5王光聚,张金聚.初中化学科学探究培养学生的创新能力刍议[J].名师在线,2017(1):68-69. 被引量：1
6李韩.音乐疗法对早产儿护理中的运用[J].世界最新医学信息文摘,2017,0(69):181-181.
7金明实.心理护理疗效评价与影响因素分析[J].中国社区医师,2018,34(3):134-134. 被引量：5
8武小鹏,张怡.基本能力考察的中韩高考数学试题对比研究——以2016年全国Ⅰ卷和韩国A卷为例[J].教学研究,2017,40(4):119-124.
9韩东明,辛大伟.构建以实际情境问题驱动课堂教学——学习黄爱华老师《生活中的百分数》教学实录感悟[J].考试周刊,2018,0(17):62-63.
10汤淑芝.植树问题[J].中国多媒体与网络教学学报（电子版）,2017,0(5):25-27.

心理科学

2018年第1期

浏览历史

内容加载中请稍等...

等级反应多水平评分者漂移模型的构建

参考文献3

二级参考文献37

共引文献5

相关作者

相关机构

相关主题

浏览历史