IRT_Δb法和修正LR法对矩阵取样DIF检验的有效性被引量：2

Applying IRT_ΔB Procedure and Adapted LR Procedure to Detect DIF in Tests with Matrix Sampling

下载PDF

导出

摘要矩阵取样测验包含多个题册,单个题册的总分不能直接作为匹配变量用于DIF检测。本研究首先基于模拟数据,同时采用IRT_Δb法,以及用IRT模型估计的考生能力作为匹配变量修订后的LR法对矩阵取样测验进行DIF检测,分析二者进行DIF检测的有效性及其相关影响因素;并根据已有的LR法DIF判断标准划定出IRT_Δb法分类标准;最后使用实证数据加以验证。结果显示:矩阵取样测验中,IRT_Δb法和修正LR法均能较好地区分DIF量不同的题目;样本量、题册中DIF题目的比例和考生群体间真实能力的差异对两种方法的检验力、犯I类错误的概率和分类结果都有较大影响。 Matrix sampling is a useful technique widely used in large-scale educational assessments. In an assessment with matrix sampling design, each examinee takes one of the multiple booklets with partial items. A critical problem of detecting differential item functioning （DIF） in such scenario has gained a lot of attention in recent years, which is, it is not appropriate to take the observed total score obtained from individual booklet as the matching variable in detecting the DIF. Therefore, the traditional detecting methods, such as Mantel-Haenszel （MH）, SIBTEST, as well as Logistic Regression （LR） are not suitable. IRT_△b might be an alternative due to its abilities to provide valid matching variable. However, the DIF classification criterion of IRT_△b was not well established yet. Thus, the purpose of this study were： 1） to investigate the efficiency and robustness of using ability parameters obtained from Item Response Theory （IRT） model as the matching variable, comparing with the way using traditional observed raw total scores ; 2） to further identify what factors will influence the abilities in detecting DIF of two methods; 3） to propose a DIF classification criteria for IRT_△b. Simulated and empirical data were both employed in this study to explore the robustness and the efficiency of the two prevailing DIF detecting methods, which were the IRT_△b method and the adapted LR method with the estimation of group-level ability based on IRT model as the matching variable. In the Monte Carlo study, a matrix sampling test was generated, and various experimental conditions were simulated as follows： 1） different proportions of DIF items; 2） different actual examinee ability distributions; 3） different sample sizes; 4） different size of DIF. Two DIF detection methods were then applied and results were compared. In addition, power functions were established in order to derive DIF classification rule for IRT Ab based on current rules for LR. In the empirical study, through conducting a DIF analysis for American and Korean mathematics tests from Programme for International Student Assessment （PISA） 2003, the consistency of the classification rules between IRT Ab and LR were further examined. The results indicated that in the matrix sampling design, both IRT_△b method and adjusted LR method were sensitive to the diverse DIF magnitude. It was also found that the power, type I error, and the final classification of both methods were also influenced by the sample size, percentage of items with DIF, and ability differences between the focused group and the reference group. In conclusion, it was found that both the IRT_△b method and adjusted LR method can be used to detect DIF in matrix sampling tests. A classification rule for IRT_△b was proposed, which are： 0.85 between negligible DIF（A） and intermediate DIF（B）, 1.23 between intermediate DIF（B） and large DIF（C）. Meanwhile, it was suggested that researchers would take this rule as a tentative principle since the AR2 was limited between a narrow interval and the classification rule of LR was very flexible compared to classification rule of MH. Further studies could be conducted to take MH, IRT_△b as well as LR into consideration simultaneously to give more comparable and consistent classification rules for different methods.

作者张勋李凌艳刘红云孙研

机构地区北京师范大学认知神经科学与学习国家重点实验室北京师范大学心理学院

出处《心理学报》 CSSCI CSCD 北大核心 2013年第8期921-934,共14页 Acta Psychologica Sinica

关键词矩阵取样测验项目功能差异 RASCH模型 LOGISTIC回归 Differential Item Functioning Matrix Sampling Rasch model Logistic regression

分类号 B841 [哲学宗教—基础心理学]

引文网络
相关文献

参考文献27

1Wu, M., Adams, R., Wilson, M., & Haldane, S. (2007). ACER ConQuest. Version 2. O. Generalised ltem Response Modelling Software [Computer software]. Camberwell: ACER Press.
2Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29(2), 133-161.
3Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159-194.
4Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland, & H. Wainer (Eds.), Differential ltem Functioning (pp. 67-114). Hillsdale, NJ: Lawrence Erlbaum.
5Organization for Economic Co-operation and Development (OECD). (2009). PISA 2006 Technical report. Paris: OECD.
6Mislevy, R. J. (1993). Should "multiple imputations" be treated as "multiple indicators"? Psychometrika, 58(1), 79-85.
7Mislevy, R. J., Johnson, E. G., & Muraki, E. (1992). Scaling procedures in NAEP. Journal of Educational Statistics, 17(2), 131-154.
8Wu, M. (2005). The role of plausible values in large-scale surveys. Studies in Educational Evaluation, 31(2-3), 114-128.
9Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278-295.
10Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329-349.

二级参考文献20

1周红.美国国家教育进展评估(NAEP)体系的产生与发展[J].外国教育研究,2005,32(2):77-80. 被引量：17
2IEA官方网页[2007-03-15],http://www.iea.nl/brief_history_of_iea.html.
3LORD F M. Estimating norms by item samplingLJJ. Educational and Psychological Measurement, 1962, (22) : 259-267.
4LORD F M, NOVICK M R. Statistical Theories of Mental Test Seores[M]. Reading, Mass. : Addison-Wesley, 1968.
5COOK D L, STUFFLEBEAM D L. Estimating test norms from variable size item and examinee samples[J]. Journal of Educational Measurement, 1967, (4): 27-33.
6FELDT L S, FORSYTH R A. An examination of the context effect in item sampling[J]. Journal of Educational Measurement, 1974, (2) : 73-83.
7SHOEMAKER D M. Principles and Procedures of Multiple Matrix Sampling [M]. Cambridge, Mass. : Ballinger, 1973.
8DINGS J, CHILDS R, KINGSTON N. The Effects of Matrix Sampling on Student Score Comparability in Constructed-response and Multiple-choice Assessments[R]. Washington, DC: Council of Chief State School Officers, 2002.
9HUSEK T R, SIROTNIK K. Item Sampling in Educational Researeh[R]. Center for the Study of Evaluation,Occasional Report No. 2. Los Angeles: University of California, 1967.
10CHUNDOWSKY N, PELI.EGRINO J W. Large-scale assessments that support learning: what will it take? [J]. Theory into Practice, 2003,(42) :75-83.

共引文献14

1辛涛,谢敏.进行大尺度评估的途径:项目矩阵取样的有效性[J].中国考试,2009(7):3-7. 被引量：1
2辛涛,王烨辉,李凌艳.新课程背景下的课程测量:框架与途径[J].北京师范大学学报（社会科学版）,2010(2):5-10. 被引量：7
3张丽,边玉芳.国外大规模阅读测评对我国阅读测验编制的启示[J].中国考试,2010(6):39-44. 被引量：2
4李凌艳,张平平.大规模教育测评中实际运用矩阵取样技术的基本问题[J].中国考试,2011(1):16-21. 被引量：2
5高燕,杨涛,辛涛.大规模教育测验似真值量表化评述[J].中国考试,2011(11):10-15. 被引量：1
6曾平飞,余娜,辛涛,王烨晖.多维Rasch模型在维度分数报告中的应用—对带宽-保真度困境的解决[J].心理发展与教育,2012,28(3):329-336. 被引量：1
7黄慧静,辛涛,李珍.矩阵取样设计中的似真值能力估计方法[J].心理科学,2012,35(5):1233-1239. 被引量：3
8韦小满,马跃.矩阵抽样技术在TIMSS2015题册设计中的运用[J].教育测量与评价（理论版）,2015,0(9):4-8. 被引量：2
9马跃,韦小满.题册设计及其在国际大型教育测评中的运用[J].教育测量与评价,2016(7):15-19.
10何孟姐,杨涛,辛涛,易芹.大规模教育测评的多题本设计[J].中国考试,2017(2):33-39. 被引量：1

同被引文献6

1陈晨.基础教育质量监测中的公平性问题——美国NAEP的政策与实践[J].外国中小学教育,2011(2):11-15. 被引量：4
2考试.考试[J].教育（综合视线）（上旬）,2011(5):9-9. 被引量：249
3郑蝉金,郭聪颖,边玉芳.变通的题组项目功能差异检验方法在篇章阅读测验中的应用[J].心理学报,2011,43(7):830-835. 被引量：13
4汪文义,张华华.统计测量视角下考试公平推动教育公平的对策[J].江西师范大学学报（自然科学版）,2017,41(4):385-393. 被引量：8
5杨涛,辛涛,罗良,王烨晖,史宁中,宋乃庆.义务教育数学教育质量监测的探索与思考[J].数学教育学报,2018,27(5):1-7. 被引量：10
6王烨晖,张岳,杨涛,王立东,梁贯成,鲍建生.义务教育数学相关因素监测工具研发的探索与思考[J].数学教育学报,2018,27(5):8-12. 被引量：15

引证文献2

1刘楚铜,金如意,何颖,张敏强,高方昕.OR法在DIF检验中的应用——以英语学业能力测验为例[J].心理科学,2023,46(2):470-477.
2刘玥,游森.教育质量监测工具的公平性研究[J].中国教育学刊,2019,0(8):24-28. 被引量：2

二级引证文献2

1胡月,靳玉乐,李宝庆.改革开放以来我国高中监测制度发展的历程、经验与趋势[J].西南大学学报（社会科学版）,2022,48(3):184-193. 被引量：1
2商敏.论我国义务教育质量监测系统的完善[J].教育进展,2021,11(1):155-160.

1王卓然,郭磊,边玉芳.认知诊断测验中的项目功能差异检测方法比较[J].心理学报,2014,46(12):1923-1932. 被引量：9
2曾秀芹,孟庆茂.项目功能差异及其检测方法[J].心理科学进展,1999,9(2):41-47. 被引量：27
3郭聪颖,边玉芳.题组项目功能差异(DIF)检验方法的应用探索[J].心理学探新,2013,33(5):423-429. 被引量：3
4刘文,谢珠斌,陈玲丽,汪招霞,阳碧云.潜在剖面分析在区分抑郁情绪患者的应用[J].心理学探新,2014,34(1):90-94. 被引量：2
5刘红云,李冲,张平平,骆方.分类数据测量等价性检验方法及其比较:项目阈值(难度)参数的组间差异性检验[J].心理学报,2012,44(8):1124-1136. 被引量：3
6张龙,涂冬波.多级计分题项目功能差异常用检测方法及比较[J].江西师范大学学报（自然科学版）,2015,39(5):441-448. 被引量：9
7郑蝉金,郭聪颖,边玉芳.变通的题组项目功能差异检验方法在篇章阅读测验中的应用[J].心理学报,2011,43(7):830-835. 被引量：13
8李文,马绍斌,范存欣,周薇薇.广州市大一新生社会适应状况及影响因素[J].中国健康心理学杂志,2015,23(4):527-529.
9韩雪,吴锐,赵守盈.Frost多维完美主义量表的Rasch分析[J].贵州师范大学学报（自然科学版）,2013,31(4):23-27. 被引量：2
10罗洪刚,罗杰,赵守盈.Guttman量表谱及其发展[J].黔南民族师范学院学报,2012,32(4):47-52. 被引量：2

心理学报

2013年第8期

浏览历史

内容加载中请稍等...

IRT_Δb法和修正LR法对矩阵取样DIF检验的有效性被引量：2

参考文献27

二级参考文献20

共引文献14

同被引文献6

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

IRT_Δb法和修正LR法对矩阵取样DIF检验的有效性 被引量：2

参考文献27

二级参考文献20

共引文献14

同被引文献6

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

IRT_Δb法和修正LR法对矩阵取样DIF检验的有效性被引量：2