This study presents a novel multimodal medical image zero-shot segmentation algorithm named the text-visual-prompt segment anything model(TV-SAM)without any manual annotations.The TV-SAM incorporates and integrates th...This study presents a novel multimodal medical image zero-shot segmentation algorithm named the text-visual-prompt segment anything model(TV-SAM)without any manual annotations.The TV-SAM incorporates and integrates the large language model GPT-4,the vision language model GLIP,and the SAM to autonomously generate descriptive text prompts and visual bounding box prompts from medical images,thereby enhancing the SAM’s capability for zero-shot segmentation.Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training.TV-SAM significantly outperforms SAM AUTO(p<0.01)and GSAM(p<0.05),closely matching the performance of SAM BBOX with gold standard bounding box prompts(p=0.07),and surpasses the state-of-the-art methods on specific datasets such as ISIC(0.853 versus 0.802)and WBC(0.968 versus 0.883).The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm,highlighting the significant contribution of GPT-4 to zero-shot segmentation.By integrating foundational models such as GPT-4,GLIP,and SAM,the ability to address complex problems in specialized domains can be enhanced.展开更多
基金supported by the National Science and Technology Major Project(No.2021YFF1201200)Chinese National Science Foundation(No.62372316)+2 种基金Sichuan Science and Technology Program(Nos.2022YFS0048,2023YFG0126,and 2024YFHZ0091)1·3·5 Project for Disciplines of Excellence,West China Hospital,Sichuan University(No.ZYYC21004)Chongqing Technology Innovation and Application Development Project(No.CSTB2022TIAD-KPX0067).
文摘This study presents a novel multimodal medical image zero-shot segmentation algorithm named the text-visual-prompt segment anything model(TV-SAM)without any manual annotations.The TV-SAM incorporates and integrates the large language model GPT-4,the vision language model GLIP,and the SAM to autonomously generate descriptive text prompts and visual bounding box prompts from medical images,thereby enhancing the SAM’s capability for zero-shot segmentation.Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training.TV-SAM significantly outperforms SAM AUTO(p<0.01)and GSAM(p<0.05),closely matching the performance of SAM BBOX with gold standard bounding box prompts(p=0.07),and surpasses the state-of-the-art methods on specific datasets such as ISIC(0.853 versus 0.802)and WBC(0.968 versus 0.883).The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm,highlighting the significant contribution of GPT-4 to zero-shot segmentation.By integrating foundational models such as GPT-4,GLIP,and SAM,the ability to address complex problems in specialized domains can be enhanced.