The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and ge...The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and genomics to enhance clinical decision-making.This systematic review explores the core methodologies and applied research frontiers of medical MLLMs,focusing on their architecture,training methods,evaluation techniques,and applications.We highlight the transformative potential of MLLMs in achieving cross-modal semantic alignment,medical knowledge integration,and robust clinical reasoning.Despite their promise,challenges such as data heterogeneity,hallucination,and computational efficiency persist.By reviewing state-of-the-art solutions and future directions,this paper provides a comprehensive technical guide for developing reliable and interpretable medical MLLMs,ultimately aiming to bridge the gap between AI and clinical practice.展开更多
BACKGROUND Gastrointestinal diseases have complex etiologies and clinical presentations.An accurate diagnosis requires physicians to integrate diverse information,including medical history,laboratory test results,and ...BACKGROUND Gastrointestinal diseases have complex etiologies and clinical presentations.An accurate diagnosis requires physicians to integrate diverse information,including medical history,laboratory test results,and imaging findings.Existing artificial intelligence-assisted diagnostic tools are limited to single-modality information,resulting in recommendations that are often incomplete and may be associated with clinical or legal risks.AIM To develop and evaluate a collaborative multimodal large language model(LLM)framework for clinical decision-making in digestive diseases.METHODS In this observational study,DeepGut,a multimodal LLM collaborative diagnostic framework,was developed to integrate four distinct large models into a four-tiered structure.The framework sequentially accomplishes multimodal infor-mation extraction,logical“chain”construction,diagnostic and treatment suggestion generation,and risk analysis.The model was evaluated using objective metrics,which assess the reliability and comprehensiveness of model-generated results,and subjective expert opinions,which examine the effectiveness of the framework in assisting physicians.RESULTS The diagnostic and treatment recommendations generated by the DeepGut framework achieved exceptional performance,with a diagnostic accuracy of 97.8%,diagnostic completeness of 93.9%,treatment plan accuracy of 95.2%,and treatment plan completeness of 98.0%,significantly surpassing the capabilities of single-modal LLM-based diagnostic tools.Experts evaluating the framework commended the completeness,relevance,and logical coherence of its outputs.However,the collaborative multimodal LLM approach resulted in increased input and output token counts,leading to higher computational costs and extended diagnostic times.CONCLUSION The framework achieves successful integration of multimodal diagnostic data,demonstrating enhanced performance enabled by multimodal LLM collaboration,which opens new horizons for the clinical application of artificial intelligence-assisted technology.展开更多
This study proposed an optimization method for multimodal large language models(MLLMs)reasoning based on structured chain of thought,aiming to enhance the visual decision-making capability in tree falling scenarios.Th...This study proposed an optimization method for multimodal large language models(MLLMs)reasoning based on structured chain of thought,aiming to enhance the visual decision-making capability in tree falling scenarios.The research first analyzed challenges faced by existing MLLMs when processing complicated visual scenes,including insufficient reasoning performance and low integration efficiency with other systems.To address these issues,an innovative structured chain of thought approach was introduced,which significantly improved the reasoning accuracy of the model in handling complex visual scenarios.To validate the proposed method,a specialized dataset focusing on tree falling scenarios in social governance was constructed,and a practical agent workflow was designed based on this dataset.Experimental results demonstrated that the proposed approach achieved better performance in real-world applications.The findings provide a reliable and efficient technical solution to visual decision-making in social governance.展开更多
Multimodal large language models(MLLMs)have demonstrated strong capabilities in vision-related tasks,capitalizing on their visual semantic comprehension and reasoning capabilities.However,their ability to detect subtl...Multimodal large language models(MLLMs)have demonstrated strong capabilities in vision-related tasks,capitalizing on their visual semantic comprehension and reasoning capabilities.However,their ability to detect subtle visual spoofing and forgery clues in face attack detection tasks remains underexplored.In this paper,we introduce a benchmark,SHIELD,to evaluate MLLMs for face spoofing and forgery detection.Specifically,we design true/false and multiple-choice questions to assess MLLM performance on multimodal face data across two tasks.For the face anti-spoofing task,we evaluate three modalities(i.e.,RGB,infrared,and depth)under six attack types.For the face forgery detection task,we evaluate GAN-based and diffusion-based data,incorporating visual and acoustic modalities.We conduct zero-shot and few-shot evaluations in standard and chain of thought(COT)settings.Additionally,we propose a novel multi-attribute chain of thought(MA-COT)paradigm for describing and judging various task-specific and task-irrelevant attributes of face images.The findings of this study demonstrate that MLLMs exhibit strong potential for addressing the challenges associated with the security of facial recognition technology applications.展开更多
Background:Multimodal large language models(LLMs)have shown potential in various health-related fields.However,many healthcare studies have raised concerns about the reliability and biases of LLMs in healthcare applic...Background:Multimodal large language models(LLMs)have shown potential in various health-related fields.However,many healthcare studies have raised concerns about the reliability and biases of LLMs in healthcare applications.Methods:To explore the practical application of multimodal LLMs in skin disease identification,and to evaluate sex and age biases,we tested the performance of 2 popular multimodal LLMs,ChatGPT-4 and LLaVA-1.6,across diverse sex and age groups using a subset of a large dermatoscopic dataset containing around 10,000 images and 3 skin diseases(melanoma,melanocytic nevi,and benign keratosis-like lesions).Results:In comparison to 3 deep learning models(VGG16,ResNet50,and Model Derm)based on convolutional neural network(CNN),one vision transformer model(Swin-B),we found that ChatGPT-4 and LLaVA-1.6 demonstrated overall accuracies that were 3% and 23% higher(and F1-scores that were 4% and 34% higher),respectively,than the best performing CNN-based baseline while maintaining accuracies that were 38% and 26% lower(and F1-scores that were 38% and 19% lower),respectively,than Swin-B.Meanwhile,ChatGPT-4 is generally unbiased in identifying these skin diseases across sex and age groups,while LLaVA-1.6 is generally unbiased across age groups,in contrast to Swin-B,which is biased in identifying melanocytic nevi.Conclusions:This study suggests the usefulness and fairness of LLMs in dermatological applications,aiding physicians and practitioners with diagnostic recommendations and patient screening.To further verify and evaluate the reliability and fairness of LLMs in healthcare,experiments using larger and more diverse datasets need to be performed in the future.展开更多
Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelli...Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelligence technologies that have demonstrated considerable potential in addressing these challenges.These models encompass large language models(LLMs),vision FMs(VFMs),and multimodal LLMs(MLLMs),all of which utilize transformer architectures and self-supervised pre-training on extensive unlabeled datasets to achieve robust cross-domain generalization.This review delineates the principal applications of these models:LLMs facilitate the structuring of clinical narratives,extraction of insights from medical records,and enhancement of physician-patient communication;VFMs are employed in the analysis of endoscopic,radiological,and pathological images for lesion detection and staging;MLLMs integrate heterogeneous data modalities,including imaging,textual information,and genomic data,to support diagnostic processes,treatment prediction,and prognostic evaluation.Despite these promising developments,several challenges remain,such as the need for data standardization,limited diversity within training datasets,substantial computational resource requirements,and ethical-legal concerns.In conclusion,FMs exhibit significant potential to advance research and clinical management of GI cancers.Future research efforts should prioritize the refinement of these models,promote international collaborations,and adopt interdisciplinary approaches.Such a comprehensive strategy is essential to fully harness the capabilities of FMs,driving substantial progress in the fight against GI malignancies.展开更多
Video-based large language models(VideoLLMs)have been recently introduced,targeting both fundamental improvements in perception and comprehension,and a diverse range of user inquiries.In pursuit of the ultimate goal o...Video-based large language models(VideoLLMs)have been recently introduced,targeting both fundamental improvements in perception and comprehension,and a diverse range of user inquiries.In pursuit of the ultimate goal of achieving artificial general intelligence,a truly intelligent Video-LLM model should not only see and understand the surroundings,but also possess human-level commonsense,and make well-informed decisions for users.To guide the development of such a model,the establishment of a robust and comprehensive evaluation system becomes crucial.To this end,this paper proposes Video-Bench,a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs.The benchmark comprises 10 meticulously crafted tasks,evaluating the capabilities of Video-LLMs across three distinct levels:video-exclusive understanding,prior knowledge-based question-answering,and comprehension and decision-making.In addition,we introduce an automatic toolkit tailored to process model outputs for various tasks,facilitating the calculation of metrics and conveniently generating final scores.We evaluate 9 representative Video-LLMs using Video-Bench.The findings reveal that current Video-LLMs still fall considerably short of achieving human-like comprehension and analysis of real-world video,and offer valuable insights for future research directions.The benchmark and toolkit are available at https://github.com/PKU-YuanGroup/Video-Bench.展开更多
Due to the digital transformation tendency among cultural institutions and the substantial influence of the social media platform,the demands of visual communication keep increasing for promoting traditional cultural ...Due to the digital transformation tendency among cultural institutions and the substantial influence of the social media platform,the demands of visual communication keep increasing for promoting traditional cultural artifacts online.As an effective medium,posters serve to attract public attention and facilitate broader engagement with cultural artifacts.However,existing poster generation methods mainly rely on fixed templates and manual design,which limits their scalability and adaptability to the diverse visual and semantic features of the artifacts.Therefore,we propose CAPGen,an automated aesthetic Cultural Artifacts Poster Generation framework built on a Multimodal Large Language Model(MLLM)with integrated iterative optimization.During our research,we collaborated with designers to define principles of graphic design for cultural artifact posters,to guide the MLLM in generating layout parameters.Later,we generated these parameters into posters.Finally,we refined the posters using an MLLM integrated with a multi-round iterative optimization mechanism.Qualitative results show that CAPGen consistently outperforms baseline methods in both visual quality and aesthetic performance.Furthermore,ablation studies indicate that the prompt,iterative optimization mechanism,and design principles significantly enhance the effectiveness of poster generation.展开更多
The rapid advancement of artificial intelligence has significantly impacted education,with largescalefoundatiomn odels(LFMs)emerging as transformative tools.While LFMs have demonstrated exceptional performance across ...The rapid advancement of artificial intelligence has significantly impacted education,with largescalefoundatiomn odels(LFMs)emerging as transformative tools.While LFMs have demonstrated exceptional performance across diverse domains,their integration into K-12 education remains in its early stages,requiring alignment with pedagogical principles,cognitive development,and curriculum standards.This paper provides a comprehensive technological review of LFM applications in K-12 education,examining current workflows,challenges,and future opportunities.We explore how LFMs facilitate personalized learning,teacher-student collaboration,and automated assessment while highlighting critical issues such as motivation,engagement,and_age-appropriate instructional strategies.By analyzing global developments,this study offers valuable insights for educators seeking to optimize AI-driven teaching methods and for students leveraging AI for self-directed learning.Our findings aim to inform future research and drive innovation in educational Al,ensuring the effective and ethical integration of LFMs into the evolving K-12 educational landscape.展开更多
Traditional quantitative investment research is encountering diminishing returns alongside rising labor and time costs.To overcome these challenges,we introduce the large investment model(LIM),a novel research paradig...Traditional quantitative investment research is encountering diminishing returns alongside rising labor and time costs.To overcome these challenges,we introduce the large investment model(LIM),a novel research paradigm designed to enhance both performance and efficiency at scale.LIM employs end-to-end learning and universal modeling to create an upstream foundation model,which is capable of autonomously learning comprehensive signal patterns from diverse financial data spanning multiple exchanges,instruments,and frequencies.These“global patterns”are subsequently transferred to downstream strategy modeling,optimizing performance for specific tasks.We detail the system architecture design of LIM,address the technical challenges inherent in this approach,and outline potential directions for future research.展开更多
This paper introduces federated services as a smart service ecology with federated security to align distributed data supply with diversified service demands spanning digital and societal contexts.It presents the comp...This paper introduces federated services as a smart service ecology with federated security to align distributed data supply with diversified service demands spanning digital and societal contexts.It presents the comprehensive researches on the theoretical foundation and technical system of federated services,aiming at advancing our understanding and implementation of this novel service paradigm.First,a thorough examination of the characteristics of federated security within federated services is conducted.Then,a five-layer technical framework is formulated under a decentralized intelligent architecture,ensuring secure,agile,and adaptable service provision.On this basis,the operational mechanisms underlying data federation and service confederation is analyzed,with emphasis on the smart supply-demand matching model.Furthermore,a scenario-oriented taxonomy of federated services accompanied by illustrative examples is proposed.Our work offers actionable insights and roadmap for realizing and advancing federated services,contributing to the refinement and wider adoption of this transformative service paradigm in the digital era.展开更多
Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval per...Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval performance, as it determines the quality of the visual content representation. Traditional sampling methods, such as uniform sampling and optical flow-based techniques, often fail to capture the full semantic range of videos, leading to redundancy and inefficiencies. In this work, we propose CLIP4Video-Sampling: Global Semantics-Guided Multi-Granularity Frame Sampling for Video-Text Retrieval, a global semantics-guided multi-granularity frame sampling strategy designed to optimize both computational efficiency and retrieval accuracy. By integrating multi-scale global and local temporal sampling and leveraging the CLIP (Contrastive Language-Image Pre-training) model’s powerful feature extraction capabilities, our method significantly outperforms existing approaches in both zero-shot and fine-tuned video-text retrieval tasks on popular datasets. CLIP4Video-Sampling reduces redundancy, ensures keyframe coverage, and serves as an adaptable pre-processing module for multimodal models.展开更多
Colonoscopy is currently one of the most sensitive screening methods for colorectal cancer.This study investigates the frontiers of intelligent colonoscopy techniques and their prospective implications for multimodal ...Colonoscopy is currently one of the most sensitive screening methods for colorectal cancer.This study investigates the frontiers of intelligent colonoscopy techniques and their prospective implications for multimodal medical applications.With this goal,we begin by assessing the current data-centric and model-centric landscapes through four tasks for colonoscopic scene perception,including classification,detection,segmentation,and vision-language understanding.Our assessment reveals domain-specific challenges and underscores the need for further multimodal research in colonoscopy.To address these gaps,we establish three foundational initiatives:a large-scale multimodal instruction tuning dataset ColonINST,a colonoscopy-designed multimodal language model ColonGPT,and a multimodal benchmark.To facilitate continuous advancements in this rapidly evolving field,we provide a public website for the latest updates:https://github.com/ai4colonoscopy/IntelliScope.展开更多
This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in thechallenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blendseamless...This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in thechallenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blendseamlessly in the surroundings for videos due to similar colors and textures and poor light conditions. Compared tothe objects in normal scenes, camouflaged objects are much more difficult to detect. SAM2, a video foundationmodel, has shown potential in various tasks. However, its effectiveness in dynamic camouflaged scenarios remainsunder-explored. This study presents a comprehensive study on SAM2’s ability in VCOS. First, we assess SAM2’sperformance on camouflaged video datasets using different models and prompts (click, box, and mask). Second, weexplore the integration of SAM2 with existing multimodal large language models (MLLMs) and VCOS methods. Third,we specifically adapt SAM2 by fine-tuning it on the video camouflaged dataset. Our comprehensive experimentsdemonstrate that SAM2 has the excellent zero-shot ability to detect camouflaged objects in videos. We also showthat this ability could be further improved by specifically adjusting SAM2’s parameters for VCOS.展开更多
Video large language models(video-LLMs)have demonstrated impressive capabilities in multimodal understanding,but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored...Video large language models(video-LLMs)have demonstrated impressive capabilities in multimodal understanding,but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored.Existing methods notably underperform in detecting critical temporal errors,such as missing,hallucinated,or misordered actions.To address this gap,we introduce two key contributions.(1)TimeJudge:a novel zero-shot framework that recasts temporal error detection as answering calibrated binary question pairs.It incorporates modality-sensitive confidence calibration and uses consistency-weighted voting for robust prediction aggregation.(2)TEDBench:a rigorously constructed benchmark featuring videos across four distinct complexity levels,specifically designed with fine-grained temporal error annotations to evaluate video-LLM performance on this task.Through a comprehensive evaluation of multiple state-of-the-art video-LLMs on TEDBench,we demonstrate that TimeJudge consistently yields substantial gains in terms of recall and F1-score without requiring any task-specific fine-tuning.Our approach provides a generalizable,scalable,and training-free solution for enhancing the temporal error detection capabilities of video-LLMs.展开更多
基金supported by the National Natural Science Foundation of China(Grant No.:62172458).
文摘The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and genomics to enhance clinical decision-making.This systematic review explores the core methodologies and applied research frontiers of medical MLLMs,focusing on their architecture,training methods,evaluation techniques,and applications.We highlight the transformative potential of MLLMs in achieving cross-modal semantic alignment,medical knowledge integration,and robust clinical reasoning.Despite their promise,challenges such as data heterogeneity,hallucination,and computational efficiency persist.By reviewing state-of-the-art solutions and future directions,this paper provides a comprehensive technical guide for developing reliable and interpretable medical MLLMs,ultimately aiming to bridge the gap between AI and clinical practice.
基金Supported by China Health Promotion Foundation Young Doctors’Research Foundation for Inflammatory Bowel DiseaseTaishan Scholars Program of Shandong Province,China,NO.tsqn202306343National Natural Science Foundation of China,No.82270580,No.82070552,No.82270578,and No.82300599.
文摘BACKGROUND Gastrointestinal diseases have complex etiologies and clinical presentations.An accurate diagnosis requires physicians to integrate diverse information,including medical history,laboratory test results,and imaging findings.Existing artificial intelligence-assisted diagnostic tools are limited to single-modality information,resulting in recommendations that are often incomplete and may be associated with clinical or legal risks.AIM To develop and evaluate a collaborative multimodal large language model(LLM)framework for clinical decision-making in digestive diseases.METHODS In this observational study,DeepGut,a multimodal LLM collaborative diagnostic framework,was developed to integrate four distinct large models into a four-tiered structure.The framework sequentially accomplishes multimodal infor-mation extraction,logical“chain”construction,diagnostic and treatment suggestion generation,and risk analysis.The model was evaluated using objective metrics,which assess the reliability and comprehensiveness of model-generated results,and subjective expert opinions,which examine the effectiveness of the framework in assisting physicians.RESULTS The diagnostic and treatment recommendations generated by the DeepGut framework achieved exceptional performance,with a diagnostic accuracy of 97.8%,diagnostic completeness of 93.9%,treatment plan accuracy of 95.2%,and treatment plan completeness of 98.0%,significantly surpassing the capabilities of single-modal LLM-based diagnostic tools.Experts evaluating the framework commended the completeness,relevance,and logical coherence of its outputs.However,the collaborative multimodal LLM approach resulted in increased input and output token counts,leading to higher computational costs and extended diagnostic times.CONCLUSION The framework achieves successful integration of multimodal diagnostic data,demonstrating enhanced performance enabled by multimodal LLM collaboration,which opens new horizons for the clinical application of artificial intelligence-assisted technology.
文摘This study proposed an optimization method for multimodal large language models(MLLMs)reasoning based on structured chain of thought,aiming to enhance the visual decision-making capability in tree falling scenarios.The research first analyzed challenges faced by existing MLLMs when processing complicated visual scenes,including insufficient reasoning performance and low integration efficiency with other systems.To address these issues,an innovative structured chain of thought approach was introduced,which significantly improved the reasoning accuracy of the model in handling complex visual scenarios.To validate the proposed method,a specialized dataset focusing on tree falling scenarios in social governance was constructed,and a practical agent workflow was designed based on this dataset.Experimental results demonstrated that the proposed approach achieved better performance in real-world applications.The findings provide a reliable and efficient technical solution to visual decision-making in social governance.
基金supported by the National Natural Science Foundation of China(No.62306061)Guangdong Basic and Applied Basic Research Foundation(No.2023A1515140037)Graduate Innovation Fund Project of Shijiazhuang Tiedao University(No.YC202449).
文摘Multimodal large language models(MLLMs)have demonstrated strong capabilities in vision-related tasks,capitalizing on their visual semantic comprehension and reasoning capabilities.However,their ability to detect subtle visual spoofing and forgery clues in face attack detection tasks remains underexplored.In this paper,we introduce a benchmark,SHIELD,to evaluate MLLMs for face spoofing and forgery detection.Specifically,we design true/false and multiple-choice questions to assess MLLM performance on multimodal face data across two tasks.For the face anti-spoofing task,we evaluate three modalities(i.e.,RGB,infrared,and depth)under six attack types.For the face forgery detection task,we evaluate GAN-based and diffusion-based data,incorporating visual and acoustic modalities.We conduct zero-shot and few-shot evaluations in standard and chain of thought(COT)settings.Additionally,we propose a novel multi-attribute chain of thought(MA-COT)paradigm for describing and judging various task-specific and task-irrelevant attributes of face images.The findings of this study demonstrate that MLLMs exhibit strong potential for addressing the challenges associated with the security of facial recognition technology applications.
基金the National Institutes of Health through the following grant:RM1HG009034(to B.A.M.).
文摘Background:Multimodal large language models(LLMs)have shown potential in various health-related fields.However,many healthcare studies have raised concerns about the reliability and biases of LLMs in healthcare applications.Methods:To explore the practical application of multimodal LLMs in skin disease identification,and to evaluate sex and age biases,we tested the performance of 2 popular multimodal LLMs,ChatGPT-4 and LLaVA-1.6,across diverse sex and age groups using a subset of a large dermatoscopic dataset containing around 10,000 images and 3 skin diseases(melanoma,melanocytic nevi,and benign keratosis-like lesions).Results:In comparison to 3 deep learning models(VGG16,ResNet50,and Model Derm)based on convolutional neural network(CNN),one vision transformer model(Swin-B),we found that ChatGPT-4 and LLaVA-1.6 demonstrated overall accuracies that were 3% and 23% higher(and F1-scores that were 4% and 34% higher),respectively,than the best performing CNN-based baseline while maintaining accuracies that were 38% and 26% lower(and F1-scores that were 38% and 19% lower),respectively,than Swin-B.Meanwhile,ChatGPT-4 is generally unbiased in identifying these skin diseases across sex and age groups,while LLaVA-1.6 is generally unbiased across age groups,in contrast to Swin-B,which is biased in identifying melanocytic nevi.Conclusions:This study suggests the usefulness and fairness of LLMs in dermatological applications,aiding physicians and practitioners with diagnostic recommendations and patient screening.To further verify and evaluate the reliability and fairness of LLMs in healthcare,experiments using larger and more diverse datasets need to be performed in the future.
基金Supported by the Open Project Program of Panxi Crops Research and Utilization Key Laboratory of Sichuan Province,No.SZKF202302the Fundamental Research Funds for the Central Universities No.2019CDYGYB024.
文摘Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelligence technologies that have demonstrated considerable potential in addressing these challenges.These models encompass large language models(LLMs),vision FMs(VFMs),and multimodal LLMs(MLLMs),all of which utilize transformer architectures and self-supervised pre-training on extensive unlabeled datasets to achieve robust cross-domain generalization.This review delineates the principal applications of these models:LLMs facilitate the structuring of clinical narratives,extraction of insights from medical records,and enhancement of physician-patient communication;VFMs are employed in the analysis of endoscopic,radiological,and pathological images for lesion detection and staging;MLLMs integrate heterogeneous data modalities,including imaging,textual information,and genomic data,to support diagnostic processes,treatment prediction,and prognostic evaluation.Despite these promising developments,several challenges remain,such as the need for data standardization,limited diversity within training datasets,substantial computational resource requirements,and ethical-legal concerns.In conclusion,FMs exhibit significant potential to advance research and clinical management of GI cancers.Future research efforts should prioritize the refinement of these models,promote international collaborations,and adopt interdisciplinary approaches.Such a comprehensive strategy is essential to fully harness the capabilities of FMs,driving substantial progress in the fight against GI malignancies.
基金supported in part by grants from the National Natural Science Foundation of China(62202014,62332002,62425101).
文摘Video-based large language models(VideoLLMs)have been recently introduced,targeting both fundamental improvements in perception and comprehension,and a diverse range of user inquiries.In pursuit of the ultimate goal of achieving artificial general intelligence,a truly intelligent Video-LLM model should not only see and understand the surroundings,but also possess human-level commonsense,and make well-informed decisions for users.To guide the development of such a model,the establishment of a robust and comprehensive evaluation system becomes crucial.To this end,this paper proposes Video-Bench,a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs.The benchmark comprises 10 meticulously crafted tasks,evaluating the capabilities of Video-LLMs across three distinct levels:video-exclusive understanding,prior knowledge-based question-answering,and comprehension and decision-making.In addition,we introduce an automatic toolkit tailored to process model outputs for various tasks,facilitating the calculation of metrics and conveniently generating final scores.We evaluate 9 representative Video-LLMs using Video-Bench.The findings reveal that current Video-LLMs still fall considerably short of achieving human-like comprehension and analysis of real-world video,and offer valuable insights for future research directions.The benchmark and toolkit are available at https://github.com/PKU-YuanGroup/Video-Bench.
基金supported by the National Key Research and Development Program of China(2023YFF0906502)the Postgraduate Research and Innovation Project of Hunan Province under Grant(CX20240473).
文摘Due to the digital transformation tendency among cultural institutions and the substantial influence of the social media platform,the demands of visual communication keep increasing for promoting traditional cultural artifacts online.As an effective medium,posters serve to attract public attention and facilitate broader engagement with cultural artifacts.However,existing poster generation methods mainly rely on fixed templates and manual design,which limits their scalability and adaptability to the diverse visual and semantic features of the artifacts.Therefore,we propose CAPGen,an automated aesthetic Cultural Artifacts Poster Generation framework built on a Multimodal Large Language Model(MLLM)with integrated iterative optimization.During our research,we collaborated with designers to define principles of graphic design for cultural artifact posters,to guide the MLLM in generating layout parameters.Later,we generated these parameters into posters.Finally,we refined the posters using an MLLM integrated with a multi-round iterative optimization mechanism.Qualitative results show that CAPGen consistently outperforms baseline methods in both visual quality and aesthetic performance.Furthermore,ablation studies indicate that the prompt,iterative optimization mechanism,and design principles significantly enhance the effectiveness of poster generation.
文摘The rapid advancement of artificial intelligence has significantly impacted education,with largescalefoundatiomn odels(LFMs)emerging as transformative tools.While LFMs have demonstrated exceptional performance across diverse domains,their integration into K-12 education remains in its early stages,requiring alignment with pedagogical principles,cognitive development,and curriculum standards.This paper provides a comprehensive technological review of LFM applications in K-12 education,examining current workflows,challenges,and future opportunities.We explore how LFMs facilitate personalized learning,teacher-student collaboration,and automated assessment while highlighting critical issues such as motivation,engagement,and_age-appropriate instructional strategies.By analyzing global developments,this study offers valuable insights for educators seeking to optimize AI-driven teaching methods and for students leveraging AI for self-directed learning.Our findings aim to inform future research and drive innovation in educational Al,ensuring the effective and ethical integration of LFMs into the evolving K-12 educational landscape.
文摘Traditional quantitative investment research is encountering diminishing returns alongside rising labor and time costs.To overcome these challenges,we introduce the large investment model(LIM),a novel research paradigm designed to enhance both performance and efficiency at scale.LIM employs end-to-end learning and universal modeling to create an upstream foundation model,which is capable of autonomously learning comprehensive signal patterns from diverse financial data spanning multiple exchanges,instruments,and frequencies.These“global patterns”are subsequently transferred to downstream strategy modeling,optimizing performance for specific tasks.We detail the system architecture design of LIM,address the technical challenges inherent in this approach,and outline potential directions for future research.
基金supported by the National Key Research and Development Program of China(2021YFB2104800)the National Natural Science Foundation of China(62103411,62436010,72171230)the Science and Technology Development Fund of Macao SAR(0093/2023/RIA2,0050/2020/A1).
文摘This paper introduces federated services as a smart service ecology with federated security to align distributed data supply with diversified service demands spanning digital and societal contexts.It presents the comprehensive researches on the theoretical foundation and technical system of federated services,aiming at advancing our understanding and implementation of this novel service paradigm.First,a thorough examination of the characteristics of federated security within federated services is conducted.Then,a five-layer technical framework is formulated under a decentralized intelligent architecture,ensuring secure,agile,and adaptable service provision.On this basis,the operational mechanisms underlying data federation and service confederation is analyzed,with emphasis on the smart supply-demand matching model.Furthermore,a scenario-oriented taxonomy of federated services accompanied by illustrative examples is proposed.Our work offers actionable insights and roadmap for realizing and advancing federated services,contributing to the refinement and wider adoption of this transformative service paradigm in the digital era.
文摘Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval performance, as it determines the quality of the visual content representation. Traditional sampling methods, such as uniform sampling and optical flow-based techniques, often fail to capture the full semantic range of videos, leading to redundancy and inefficiencies. In this work, we propose CLIP4Video-Sampling: Global Semantics-Guided Multi-Granularity Frame Sampling for Video-Text Retrieval, a global semantics-guided multi-granularity frame sampling strategy designed to optimize both computational efficiency and retrieval accuracy. By integrating multi-scale global and local temporal sampling and leveraging the CLIP (Contrastive Language-Image Pre-training) model’s powerful feature extraction capabilities, our method significantly outperforms existing approaches in both zero-shot and fine-tuned video-text retrieval tasks on popular datasets. CLIP4Video-Sampling reduces redundancy, ensures keyframe coverage, and serves as an adaptable pre-processing module for multimodal models.
基金supported by NSFC,China(No.62476143)the Fundamental Research Funds for the Central Universities,China(Nankai University,No.63253218)+1 种基金ANU-Optus Bushfire Research Centre of Excellence(BRCoE)(scholarship awarded to Ge-Peng Ji)supported by Natural Science Foundation of China(NSFC)(No.62306162).
文摘Colonoscopy is currently one of the most sensitive screening methods for colorectal cancer.This study investigates the frontiers of intelligent colonoscopy techniques and their prospective implications for multimodal medical applications.With this goal,we begin by assessing the current data-centric and model-centric landscapes through four tasks for colonoscopic scene perception,including classification,detection,segmentation,and vision-language understanding.Our assessment reveals domain-specific challenges and underscores the need for further multimodal research in colonoscopy.To address these gaps,we establish three foundational initiatives:a large-scale multimodal instruction tuning dataset ColonINST,a colonoscopy-designed multimodal language model ColonGPT,and a multimodal benchmark.To facilitate continuous advancements in this rapidly evolving field,we provide a public website for the latest updates:https://github.com/ai4colonoscopy/IntelliScope.
文摘This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in thechallenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blendseamlessly in the surroundings for videos due to similar colors and textures and poor light conditions. Compared tothe objects in normal scenes, camouflaged objects are much more difficult to detect. SAM2, a video foundationmodel, has shown potential in various tasks. However, its effectiveness in dynamic camouflaged scenarios remainsunder-explored. This study presents a comprehensive study on SAM2’s ability in VCOS. First, we assess SAM2’sperformance on camouflaged video datasets using different models and prompts (click, box, and mask). Second, weexplore the integration of SAM2 with existing multimodal large language models (MLLMs) and VCOS methods. Third,we specifically adapt SAM2 by fine-tuning it on the video camouflaged dataset. Our comprehensive experimentsdemonstrate that SAM2 has the excellent zero-shot ability to detect camouflaged objects in videos. We also showthat this ability could be further improved by specifically adjusting SAM2’s parameters for VCOS.
基金Project supported by the National Natural Science Foundation of China(Nos.62272184 and 62402189)the China Postdoctoral Science Foundation(Nos.2024M751012,2025T180429,and GZC20230894)the Postdoctor Project of Hubei Province(No.2024HBBHCXB014)。
文摘Video large language models(video-LLMs)have demonstrated impressive capabilities in multimodal understanding,but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored.Existing methods notably underperform in detecting critical temporal errors,such as missing,hallucinated,or misordered actions.To address this gap,we introduce two key contributions.(1)TimeJudge:a novel zero-shot framework that recasts temporal error detection as answering calibrated binary question pairs.It incorporates modality-sensitive confidence calibration and uses consistency-weighted voting for robust prediction aggregation.(2)TEDBench:a rigorously constructed benchmark featuring videos across four distinct complexity levels,specifically designed with fine-grained temporal error annotations to evaluate video-LLM performance on this task.Through a comprehensive evaluation of multiple state-of-the-art video-LLMs on TEDBench,we demonstrate that TimeJudge consistently yields substantial gains in terms of recall and F1-score without requiring any task-specific fine-tuning.Our approach provides a generalizable,scalable,and training-free solution for enhancing the temporal error detection capabilities of video-LLMs.