Objective: This study assesses the quality of artificial intelligence chatbots in responding to standardized obstetrics and gynecology questions. Methods: Using ChatGPT-3.5, ChatGPT-4.0, Bard, and Claude to respond to...Objective: This study assesses the quality of artificial intelligence chatbots in responding to standardized obstetrics and gynecology questions. Methods: Using ChatGPT-3.5, ChatGPT-4.0, Bard, and Claude to respond to 20 standardized multiple choice questions on October 7, 2023, responses and correctness were recorded. A logistic regression model assessed the relationship between question character count and accuracy. For each incorrect question, an independent error analysis was undertaken. Results: ChatGPT-4.0 scored a 100% across both obstetrics and gynecology questions. ChatGPT-3.5 scored a 95% overall, earning an 85.7% in obstetrics and a 100% in gynecology. Claude scored a 90% overall, earning a 100% in obstetrics and an 84.6% in gynecology. Bard scored a 77.8% overall, earning an 83.3% in obstetrics and a 75% in gynecology and would not respond to two questions. There was no statistical significance between character count and accuracy. Conclusions: ChatGPT-3.5 and ChatGPT-4.0 excelled in both obstetrics and gynecology while Claude performed well in obstetrics but possessed minor weaknesses in gynecology. Bard comparatively performed the worst and had the most limitations, leading to our support of the other artificial intelligence chatbots as preferred study tools. Our findings support the use of chatbots as a supplement, not a substitute for clinician-based learning or historically successful educational tools.展开更多
Background: A recent assessment of ChatGPT on a variety of obstetric and gynecologic topics was very encouraging. However, its ability to respond to commonly asked pregnancy questions is unknown. Reference verificatio...Background: A recent assessment of ChatGPT on a variety of obstetric and gynecologic topics was very encouraging. However, its ability to respond to commonly asked pregnancy questions is unknown. Reference verification needs to be examined as well. Purpose: To evaluate ChatGPT as a source of information for commonly asked pregnancy questions and to verify the references it provides. Methods: Qualitative analysis of ChatGPT was performed. We queried ChatGPT Version 3.5 on 12 commonly asked pregnancy questions and asked for its references. Query responses were graded as “acceptable” or “not acceptable” based on correctness and completeness in comparison to American College of Obstetricians and Gynecologists (ACOG) publications, PubMed-indexed evidence, and clinical experience. References were classified as “verified”, “broken”, “irrelevant”, “non-existent” or “no references”. Review and grading of responses and references were performed by the co-authors individually and then as a group to formulate a consensus. Results: In our assessment, a grade of acceptable was given to 50% of responses (6 out of 12 questions). A grade of not acceptable was assigned to the remaining 50% of responses (5 were incomplete and 1 was incorrect). In regard to references, 58% (7 out of 12) had deficiencies (5 had no references, 1 had a broken reference, and 1 non-existent reference was provided). Conclusion: Our evaluation of ChatGPT confirms prior concerns regarding both content and references. While AI has enormous potential, it must be carefully evaluated before being accepted as accurate and reliable for this purpose.展开更多
文摘Objective: This study assesses the quality of artificial intelligence chatbots in responding to standardized obstetrics and gynecology questions. Methods: Using ChatGPT-3.5, ChatGPT-4.0, Bard, and Claude to respond to 20 standardized multiple choice questions on October 7, 2023, responses and correctness were recorded. A logistic regression model assessed the relationship between question character count and accuracy. For each incorrect question, an independent error analysis was undertaken. Results: ChatGPT-4.0 scored a 100% across both obstetrics and gynecology questions. ChatGPT-3.5 scored a 95% overall, earning an 85.7% in obstetrics and a 100% in gynecology. Claude scored a 90% overall, earning a 100% in obstetrics and an 84.6% in gynecology. Bard scored a 77.8% overall, earning an 83.3% in obstetrics and a 75% in gynecology and would not respond to two questions. There was no statistical significance between character count and accuracy. Conclusions: ChatGPT-3.5 and ChatGPT-4.0 excelled in both obstetrics and gynecology while Claude performed well in obstetrics but possessed minor weaknesses in gynecology. Bard comparatively performed the worst and had the most limitations, leading to our support of the other artificial intelligence chatbots as preferred study tools. Our findings support the use of chatbots as a supplement, not a substitute for clinician-based learning or historically successful educational tools.
文摘Background: A recent assessment of ChatGPT on a variety of obstetric and gynecologic topics was very encouraging. However, its ability to respond to commonly asked pregnancy questions is unknown. Reference verification needs to be examined as well. Purpose: To evaluate ChatGPT as a source of information for commonly asked pregnancy questions and to verify the references it provides. Methods: Qualitative analysis of ChatGPT was performed. We queried ChatGPT Version 3.5 on 12 commonly asked pregnancy questions and asked for its references. Query responses were graded as “acceptable” or “not acceptable” based on correctness and completeness in comparison to American College of Obstetricians and Gynecologists (ACOG) publications, PubMed-indexed evidence, and clinical experience. References were classified as “verified”, “broken”, “irrelevant”, “non-existent” or “no references”. Review and grading of responses and references were performed by the co-authors individually and then as a group to formulate a consensus. Results: In our assessment, a grade of acceptable was given to 50% of responses (6 out of 12 questions). A grade of not acceptable was assigned to the remaining 50% of responses (5 were incomplete and 1 was incorrect). In regard to references, 58% (7 out of 12) had deficiencies (5 had no references, 1 had a broken reference, and 1 non-existent reference was provided). Conclusion: Our evaluation of ChatGPT confirms prior concerns regarding both content and references. While AI has enormous potential, it must be carefully evaluated before being accepted as accurate and reliable for this purpose.