This paper derives the variance of the information content and develops its statistical inference method. We describe the relations between information content and sensitivity, specificity, efficiency, prevalence rate...This paper derives the variance of the information content and develops its statistical inference method. We describe the relations between information content and sensitivity, specificity, efficiency, prevalence rate. If sensitivity, specificity and efficiency are fixed, the closer to 0. 5 the prevalence rate is, the more the information content. If prevalence rate and efficiency are fixed, the closer to each other the sensitivity and specificity are, the more the information content. We compare the power of information content method, efficiecy test, Youden's index test and kappa coefficient method. The information content method has higher power than the other methods in most conditions. It is especially sensitive to the difference between two sensitivities. It comes to conclusion that the information content method has more virtues than the other methods mentioned in this paper.展开更多
Boundary recognition is an important research of natural language processing, and it provides a basis for the application of Chinese word segmentation, chunk analysis, named entity recognition, etc. Based on ambiguity...Boundary recognition is an important research of natural language processing, and it provides a basis for the application of Chinese word segmentation, chunk analysis, named entity recognition, etc. Based on ambiguity in boundary recognition of Chinese punctuation marks, this paper proposes grammar testing methods for boundary recognition of slight-pause marks and then calculates the annotation consistency of these methods. The statistical results show that grammar testing methods can greatly improve the annotation consistency of slight-pause marks boundary recognition. The consistency during the second time is 0.030 3 higher than during the first, which will help guarantee the consistency of large-scale corpus annotation and improve the quality of corpus annotation.展开更多
文摘This paper derives the variance of the information content and develops its statistical inference method. We describe the relations between information content and sensitivity, specificity, efficiency, prevalence rate. If sensitivity, specificity and efficiency are fixed, the closer to 0. 5 the prevalence rate is, the more the information content. If prevalence rate and efficiency are fixed, the closer to each other the sensitivity and specificity are, the more the information content. We compare the power of information content method, efficiecy test, Youden's index test and kappa coefficient method. The information content method has higher power than the other methods in most conditions. It is especially sensitive to the difference between two sensitivities. It comes to conclusion that the information content method has more virtues than the other methods mentioned in this paper.
基金Supported by the National Natural Science Foundation of China(61373108)Humanities and Social Science Foundation of Ministry of Education of China(16YJCZH004)the Major Projects of the National Social Science Foundation of China(11&ZD189)
文摘Boundary recognition is an important research of natural language processing, and it provides a basis for the application of Chinese word segmentation, chunk analysis, named entity recognition, etc. Based on ambiguity in boundary recognition of Chinese punctuation marks, this paper proposes grammar testing methods for boundary recognition of slight-pause marks and then calculates the annotation consistency of these methods. The statistical results show that grammar testing methods can greatly improve the annotation consistency of slight-pause marks boundary recognition. The consistency during the second time is 0.030 3 higher than during the first, which will help guarantee the consistency of large-scale corpus annotation and improve the quality of corpus annotation.