摘要
矩阵取样是大规模教育评估中最有效的一种数据收集方式。本研究采用模拟数据考察在均衡的不完全分块(BIB)矩阵取样设计中,似真值(PV)与传统的MLE、WLE和EAP方法对学生能力总体参数估计的精确性和稳健性。结果表明,PV对总体平均数和标准差的估计最为精确和稳健;EAP倾向于低估,MLE和WLE倾向于高估,且精确性和稳健性远远不如PV。同时,总被试量对估计结果的影响很小,而每个题本中的项目数量对估计结果的影响较大。
In order to expand the coverage of the subject to areas and reduce the test time for individuals, multiple matrix sampling is often used in large-scale educational assessments. Since scores are reported to the government and the public, more attention has been paid to population statistics; reducing population error becomes important. Consequently, researchers use plausible values (PV) to ac- count for the uncertainty about the latent traits. A simulation study was used to compare PV and traditional methods (MLE, WLE and EAP) for group-level estimation (mean and standard deviation) in different matrix sampling. The Results could provide evidence to support the student performance report to large-scale assessments. In a simulation study, a data file containing student responses was generated for different item tests with various students. The independent variables were the number of items in each form and sample size. The number of items had three levels: 8/16/24; and the sample size also had three levels: 490/980/4900. There were three con- trol variables: the total item numbers (56 dichotomous numbers), the distribution and range of item difficulty ( -3,3 ), the distribution and value range of ability ( 3,3 ) ( Wu, 2005). The Balanced Incomplete Block Design (BIB) was used as the method of sampling and Rasch model was employed in the data analysis. EAP, MLE, WLE and five PVs were computed for each student, and the sample means and standard deviations were computed for each of these sets. Two statistical indices, ABS and RMSD, were used to compare the accuracy and robustness of the PV method and other traditional estimating methods. The results indicated that the accuracy and robustness of PV were the best, close to the true values ; even in unfavorable situations, when the total number of subjects or items in each testlet was especially low, PV can still provide a good estimate. EAP, MLE and WLE could provide as favorable estimate for population means as PV, but bias appeared when they were used to estimate population standard deviation. EAP was an underestimate and both MLE and WLE were overestimates of the population variance, even when the number of subjects and items was the largest. Meanwhile, the bias did not diminish when the sample size increased, but it reduced as the number of items increased, indicating that in order to improve the precision and stability of estimating methods, adding more items plays a more important role than increasing the subject number. The current study considered the simplest matrix sampling design with the Rasch model only. Future study should take more com- plex designs into consideration and 2 or 3 parameter models should also be used. Furthermore, the sample size and the number of items are two basic factors influencing the population parameter estimation. So some other factors, such as test length, item difficulty, item type, have to be controlled for further inference.
出处
《心理科学》
CSSCI
CSCD
北大核心
2012年第5期1233-1239,共7页
Journal of Psychological Science
基金
教育部新世纪优秀人才支持计划(NCET-07-0097)的资助
关键词
大规模教育评估矩阵取样似真值
large-scale assessment, multiple matrix sampling, plausible value