摘要
在卫生健康行业的复杂应用场景中,生成式人工智能大语言模型技术(大模型)的专业领域适应性限制了大模型在医疗卫生、金融等专业领域的广泛应用和创新能力,目前自监督学习依赖的开放语料也难以满足医疗卫生领域高精度、高特异性的专业要求.大模型通过卫生健康行业多模态语料的训练,生成卫生健康行业垂直大模型.该垂直大模型具备专业领域的知识和针对性解决医疗卫生问题的能力,满足卫生健康行业的专业应用需求,达到通用大模型无法替代的专业性、高效性和精度.本文系统梳理国内外卫生健康行业的语料库建设模式、技术实现及其不足,提出以“疾病–场景关联矩阵”为核心的标准化框架,通过任务匹配机制实现疾病分类与医疗卫生场景之间的多维映射.同时提出构建涵盖数据采集、质量评估、数据标注与隐私保护等关键环节的专业多模态语料库标准体系,形成任务驱动、分级适配的多层次多模态语料资源结构.通过建立质量控制与反馈闭环机制,实现行业多模态语料库的动态优化与持续迭代,为构建覆盖医疗卫生领域业务需求高质量数据的卫生健康行业多模态语料库提供系统性方法与理论支撑.
The rapid advancement and widespread adoption of generative artificial intelligence(AI)technologies have demonstrated significant potential across a variety of sectors.However,in the medical domain-characterized by complexity,high precision,and specialized knowledge requirements-the application of general-purpose large language models(LLMs)is often constrained by limitations in domain adaptability.While general LLMs leverage self-supervised learning based on large-scale open-domain corpora,such data sources typically lack the granularity,specificity,and semantic precision necessary for healthcare and biomedical applications.Consequently,the efficacy and reliability of these models in clinical and healthcare-related scenarios remain limited.In contrast,vertical domain-specific large models(often referred to as vertical foundation models)offer promising solutions to overcome these challenges.These models are designed with a focus on domain specialization,incorporating expert-curated corpora,fine-grained medical ontologies,and targeted task formulations.This specialized approach enables vertical models to achieve higher accuracy,better contextual understanding,and more effective task performance in scenarios where general models fail to deliver sufficient precision.This paper conducts a comprehensive analysis of existing health and medical corpus construction practices,both domestically and internationally,identifying critical gaps in data structure,standardization,and adaptability to AI tasks.Building upon this analysis,we propose a standardized construction framework centered on a“Disease-Scenario Association Matrix”.This framework facilitates the multidimensional mapping between disease classifications-based on standards such as ICD-10 and adjusted for domestic needs-and medical application scenarios,which are guided by authoritative references such as the“Guidelines for AI Application Scenarios in the Health Industry”issued by national health authorities.The matrix serves as a dynamic,task-driven mechanism that connects disease characteristics to specific healthcare scenarios,enabling targeted corpus development for high-priority use cases.To ensure data utility,quality,and long-term sustainability,we further introduce a corpus standards system encompassing four critical dimensions:data acquisition,quality evaluation,annotation protocols,and privacy protection.Data collection protocols are aligned with international standards such as HL7 and FHIR to ensure structural and semantic interoperability.Quality assessment frameworks are developed based on criteria such as completeness,accuracy,timeliness,and consistency,integrating both automated and manual validation mechanisms.Annotation systems are designed with hierarchical and rule-based structures,leveraging domain-specific pre-trained models such as BioBERT and U-Net for semi-automated labeling,followed by expert review for validation.Meanwhile,privacy protection is achieved through a combination of data de-identification,encryption,federated learning,and access control strategies to ensure full compliance with data governance and ethical standards.Finally,a dynamic feedback and quality control loop is incorporated into the corpus development process to enable continuous updates,refinement,and expansion.By integrating these mechanisms into a multi-level,task-adaptive architecture,this framework lays the methodological and theoretical foundation for constructing a high-quality,multimodal AI corpus tailored to the unique demands of the healthcare sector.This corpus will support the development and deployment of vertical domain large models,unlocking new capabilities for intelligent diagnostics,clinical decision support,and precision medicine.
作者
沈剑峰
黄茹
闵栋
车慧
李宝山
刘丽红
张智
程京
王杉
Jianfeng Shen;Ru Huang;Dong Min;Hui Che;Baoshan Li;Lihong Liu;Zhi Zhang;Jing Cheng;Shan Wang(AI Project Team,Smart Hospital Branch,Chinese Society of Medical Equipment,Beijing 100082,China;Institute of Cloud Computing and Big Data,China Academy of Information and Communications Technology(CAICT),Beijing 100083,China;Information Center,Peking University People’s Hospital,Beijing 100044,China;Institute of Big Data and Artificial Intelligence,National Engineering Research Center for Beijing Biochips Technology,Beijing 102206,China;School of Biomedical Engineering,Tsinghua University,Beijing 100084,China;Surgical Oncology Laboratory,Clinical Big Data Research Center,Peking University People’s Hospital,Beijing 100044,China)
出处
《科学通报》
北大核心
2025年第26期4560-4568,共9页
Chinese Science Bulletin
关键词
生成式人工智能
大语言模型
垂直大模型
语料
多模态语料库
卫生健康行业
generative artificial intelligence
large language model
vertical large model
corpus
multimodal corpora
health industry