Objective:To comparatively analyze the readability and information quality of the educational materials for patients undergoing thoracoscopic lobectomy in both Chinese and English versions generated by three mainstrea...Objective:To comparatively analyze the readability and information quality of the educational materials for patients undergoing thoracoscopic lobectomy in both Chinese and English versions generated by three mainstream Large Language Models(LLMS),namely DeepSeek,Grok-3 and ChatGPT,Provide evidence-based basis for the clinical selection of AI-assisted educational tools.Method:A cross-sectional study design was adopted,with“education for patients undergoing thoracoscopic lobectomy”as the core requirement.Standardized Chinese and English prompts were designed to drive each of the three models to generate 3 independent educational materials(a total of 18,9 in Chinese and 9 in English).The readability was evaluated using the internationally recognized readability assessment tools(English:Flesch-Kincaid Grade Level,FKGL;Flesch Reading Ease,FRE;Chinese:average sentence length),and the DISCERN scale was used to evaluate the quality of information.The diff erences among the three models were compared by the Kruskal-Wallis H test,the diff erences between the Chinese and English versions were analyzed by the paired sample t-test,and the reliability of the raters was tested by the intraclass correlation coeffi cient(ICC).Result:1.Readability:In the English version,DeepSeek V3 had the highest FRE score(80.36±1.18)and the lowest FKGL score(4.83±0.12),which was significantly better than ChatGPT-o3(FRE:67.36±0.74,FKGL:)(6.56±0.36)and Grok3(FRE:45.67±1.65,FKGL:11.93±0.17)(P<0.05);Among the Chinese versions,Grok3 had the shortest average sentence length(17.74±1.02 characters),which was signifi cantly better than ChatGPT-o3(27.81±1.47 characters)and DeepSeek V3(26.75±1.18 characters)(P<0.05).2.Information quality:The reliability of the raters was excellent(ICC=0.92,95%CI:0.925-0.998,P<0.001);The DISCERN total scores of the Chinese and English versions of the three models were all at the“good-excellent”level(59.00-71.17 points).Among them,the total scores of the Chinese and English versions of ChatGPT-o3 were the highest(English:71.17±1.17,Chinese:70.50±0.55),and Grok3 was the lowest(English:(63.17±0.94,Chinese:59.00±0.89)),and the diff erence between groups was statistically significant(P<0.05).Conclusion:Among the educational materials for thoracoscopic lobectomy generated by the three LLMS,the English version of DeepSeeking V3 has the best readability,the Chinese version of Grok3 has outstanding reading fl uency,and the comprehensive performance of the Chinese and English versions of ChatGPT-o3 is balanced.The Chinese version still needs to be optimized in terms of terminology consistency and information details.When applying it in clinical practice,the model should be selected in combination with language requirements,and the content generated by AI should be professionally reviewed.展开更多
基金supported by the National High Level Hospital Clinical Research Funding(80102022501).
文摘Objective:To comparatively analyze the readability and information quality of the educational materials for patients undergoing thoracoscopic lobectomy in both Chinese and English versions generated by three mainstream Large Language Models(LLMS),namely DeepSeek,Grok-3 and ChatGPT,Provide evidence-based basis for the clinical selection of AI-assisted educational tools.Method:A cross-sectional study design was adopted,with“education for patients undergoing thoracoscopic lobectomy”as the core requirement.Standardized Chinese and English prompts were designed to drive each of the three models to generate 3 independent educational materials(a total of 18,9 in Chinese and 9 in English).The readability was evaluated using the internationally recognized readability assessment tools(English:Flesch-Kincaid Grade Level,FKGL;Flesch Reading Ease,FRE;Chinese:average sentence length),and the DISCERN scale was used to evaluate the quality of information.The diff erences among the three models were compared by the Kruskal-Wallis H test,the diff erences between the Chinese and English versions were analyzed by the paired sample t-test,and the reliability of the raters was tested by the intraclass correlation coeffi cient(ICC).Result:1.Readability:In the English version,DeepSeek V3 had the highest FRE score(80.36±1.18)and the lowest FKGL score(4.83±0.12),which was significantly better than ChatGPT-o3(FRE:67.36±0.74,FKGL:)(6.56±0.36)and Grok3(FRE:45.67±1.65,FKGL:11.93±0.17)(P<0.05);Among the Chinese versions,Grok3 had the shortest average sentence length(17.74±1.02 characters),which was signifi cantly better than ChatGPT-o3(27.81±1.47 characters)and DeepSeek V3(26.75±1.18 characters)(P<0.05).2.Information quality:The reliability of the raters was excellent(ICC=0.92,95%CI:0.925-0.998,P<0.001);The DISCERN total scores of the Chinese and English versions of the three models were all at the“good-excellent”level(59.00-71.17 points).Among them,the total scores of the Chinese and English versions of ChatGPT-o3 were the highest(English:71.17±1.17,Chinese:70.50±0.55),and Grok3 was the lowest(English:(63.17±0.94,Chinese:59.00±0.89)),and the diff erence between groups was statistically significant(P<0.05).Conclusion:Among the educational materials for thoracoscopic lobectomy generated by the three LLMS,the English version of DeepSeeking V3 has the best readability,the Chinese version of Grok3 has outstanding reading fl uency,and the comprehensive performance of the Chinese and English versions of ChatGPT-o3 is balanced.The Chinese version still needs to be optimized in terms of terminology consistency and information details.When applying it in clinical practice,the model should be selected in combination with language requirements,and the content generated by AI should be professionally reviewed.