卫生健康行业垂直大模型破茧之基石——构建行业专业多模态语料库被引量：1

The foundational cornerstone for healthcare AI models:constructing multimodal corpora in the health sector

导出

摘要在卫生健康行业的复杂应用场景中,生成式人工智能大语言模型技术(大模型)的专业领域适应性限制了大模型在医疗卫生、金融等专业领域的广泛应用和创新能力,目前自监督学习依赖的开放语料也难以满足医疗卫生领域高精度、高特异性的专业要求.大模型通过卫生健康行业多模态语料的训练,生成卫生健康行业垂直大模型.该垂直大模型具备专业领域的知识和针对性解决医疗卫生问题的能力,满足卫生健康行业的专业应用需求,达到通用大模型无法替代的专业性、高效性和精度.本文系统梳理国内外卫生健康行业的语料库建设模式、技术实现及其不足,提出以“疾病–场景关联矩阵”为核心的标准化框架,通过任务匹配机制实现疾病分类与医疗卫生场景之间的多维映射.同时提出构建涵盖数据采集、质量评估、数据标注与隐私保护等关键环节的专业多模态语料库标准体系,形成任务驱动、分级适配的多层次多模态语料资源结构.通过建立质量控制与反馈闭环机制,实现行业多模态语料库的动态优化与持续迭代,为构建覆盖医疗卫生领域业务需求高质量数据的卫生健康行业多模态语料库提供系统性方法与理论支撑. The rapid advancement and widespread adoption of generative artificial intelligence(AI)technologies have demonstrated significant potential across a variety of sectors.However,in the medical domain-characterized by complexity,high precision,and specialized knowledge requirements-the application of general-purpose large language models(LLMs)is often constrained by limitations in domain adaptability.While general LLMs leverage self-supervised learning based on large-scale open-domain corpora,such data sources typically lack the granularity,specificity,and semantic precision necessary for healthcare and biomedical applications.Consequently,the efficacy and reliability of these models in clinical and healthcare-related scenarios remain limited.In contrast,vertical domain-specific large models(often referred to as vertical foundation models)offer promising solutions to overcome these challenges.These models are designed with a focus on domain specialization,incorporating expert-curated corpora,fine-grained medical ontologies,and targeted task formulations.This specialized approach enables vertical models to achieve higher accuracy,better contextual understanding,and more effective task performance in scenarios where general models fail to deliver sufficient precision.This paper conducts a comprehensive analysis of existing health and medical corpus construction practices,both domestically and internationally,identifying critical gaps in data structure,standardization,and adaptability to AI tasks.Building upon this analysis,we propose a standardized construction framework centered on a“Disease-Scenario Association Matrix”.This framework facilitates the multidimensional mapping between disease classifications-based on standards such as ICD-10 and adjusted for domestic needs-and medical application scenarios,which are guided by authoritative references such as the“Guidelines for AI Application Scenarios in the Health Industry”issued by national health authorities.The matrix serves as a dynamic,task-driven mechanism that connects disease characteristics to specific healthcare scenarios,enabling targeted corpus development for high-priority use cases.To ensure data utility,quality,and long-term sustainability,we further introduce a corpus standards system encompassing four critical dimensions:data acquisition,quality evaluation,annotation protocols,and privacy protection.Data collection protocols are aligned with international standards such as HL7 and FHIR to ensure structural and semantic interoperability.Quality assessment frameworks are developed based on criteria such as completeness,accuracy,timeliness,and consistency,integrating both automated and manual validation mechanisms.Annotation systems are designed with hierarchical and rule-based structures,leveraging domain-specific pre-trained models such as BioBERT and U-Net for semi-automated labeling,followed by expert review for validation.Meanwhile,privacy protection is achieved through a combination of data de-identification,encryption,federated learning,and access control strategies to ensure full compliance with data governance and ethical standards.Finally,a dynamic feedback and quality control loop is incorporated into the corpus development process to enable continuous updates,refinement,and expansion.By integrating these mechanisms into a multi-level,task-adaptive architecture,this framework lays the methodological and theoretical foundation for constructing a high-quality,multimodal AI corpus tailored to the unique demands of the healthcare sector.This corpus will support the development and deployment of vertical domain large models,unlocking new capabilities for intelligent diagnostics,clinical decision support,and precision medicine.

作者沈剑峰黄茹闵栋车慧李宝山刘丽红张智程京王杉 Jianfeng Shen;Ru Huang;Dong Min;Hui Che;Baoshan Li;Lihong Liu;Zhi Zhang;Jing Cheng;Shan Wang(AI Project Team,Smart Hospital Branch,Chinese Society of Medical Equipment,Beijing 100082,China;Institute of Cloud Computing and Big Data,China Academy of Information and Communications Technology(CAICT),Beijing 100083,China;Information Center,Peking University People’s Hospital,Beijing 100044,China;Institute of Big Data and Artificial Intelligence,National Engineering Research Center for Beijing Biochips Technology,Beijing 102206,China;School of Biomedical Engineering,Tsinghua University,Beijing 100084,China;Surgical Oncology Laboratory,Clinical Big Data Research Center,Peking University People’s Hospital,Beijing 100044,China)

机构地区中国医学装备协会智慧医院分会人工智能研究小组中国信息通信研究院云计算与大数据研究所北京大学人民医院信息中心生物芯片北京国家工程研究中心大数据与人工智能研究院清华大学生物医学工程学院北京大学人民医院临床医学大数据研究中心

出处《科学通报》北大核心 2025年第26期4560-4568,共9页 Chinese Science Bulletin

关键词生成式人工智能大语言模型垂直大模型语料多模态语料库卫生健康行业 generative artificial intelligence large language model vertical large model corpus multimodal corpora health industry

分类号 R-05 [医药卫生] TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

同被引文献32

1侯岩.新年献词[J].中国医学装备,2025,22(1):1-1. 被引量：1
2中国医学装备协会皮肤病与皮肤美容分会,张学军,任韵清,盛宇俊,张博,朱才红.皮肤病靶向治疗专家共识(2025版)[J].中华皮肤科杂志,2025,58(2):99-125. 被引量：4
3中国医师协会医学技师委员会病理技术专家组,中国医学装备协会病理装备分会标准化部,中国抗癌协会肿瘤病理专业委员会病理技术学组,中国研究型医院学会病理学专业委员会病理技术学组,周立新,张炜明,薛晓伟,徐黎明,丁伟.EBER原位杂交检测技术专家共识[J].临床与实验病理学杂志,2025,41(2):157-161. 被引量：1
4中国康复医学会皮肤病康复专业委员会,中华医学会皮肤性病学分会,中国医学装备协会皮肤病与皮肤美容分会光医学治疗装备学组,王秀丽,高兴华,张国龙,赵子君.红蓝黄光治疗皮肤病临床应用专家共识(2025版)[J].中华皮肤科杂志,2025,58(3):209-215. 被引量：2
5中国医药教育协会智能眼科分会,中国医学装备协会眼科专业委员会屈光不正防治学组,毕燕龙,袁进,杨卫华,柳林.眼前后节联合手术时机及人工晶状体植入等屈光策略专家共识(2025)[J].中华实验眼科杂志,2025,43(3):193-203. 被引量：2
6中华医学会医学美学与美容学分会激光美容学组、皮肤美容学组,中国医师协会美容与整形医师分会激光美容学组,中国整形美容协会新技术与新材料分会,中国中西医结合学会医学美容专业委员会激光美容学组,中国医学装备协会皮肤病与皮肤美容分会激光学组,杨蓉娅,金善子.微针点阵射频临床应用专家共识(2025版)[J].实用皮肤病学杂志,2025,18(1):1-6. 被引量：1
7中国医学装备协会检验医学分会,徐英春,顾兵,赵云虎,胡雪姣,禄梦笛,凌勇,孟玥.靶向二代测序在感染性疾病诊疗中的规范化应用专家共识2025[J].中华检验医学杂志,2025,48(4):469-477. 被引量：5
8《公立医院政府采购内部控制管理专家共识》专家工作组,林青,谭德国,晏妮,彭丽娟,姜晨.公立医院政府采购内部控制管理专家共识[J].中华医院管理杂志,2025,41(1):49-54. 被引量：1
9中国医学装备协会核医学装备与技术专业委员会,李方,耿建华,景红丽,陈英茂,缪蔚冰,张伟,于淑丽,高飞,张旻佳.CZT探测器SPECT和SPECT/CT设备的质量保障专家共识[J].中国医学装备,2025,22(6):1-6. 被引量：1
10付锦,赵小利,曾凯,宋伟,彭雄俊,周婧,卢东生.态势分析法联合层次分析法在急救装备运行质量管理中的价值研究[J].中国医学装备,2025,22(6):129-134. 被引量：1

引证文献1

1李志勇,薛瀚,赵小瑞,崔泽实.2025年中国医学装备协会发表科技期刊论文数据分析[J].中国医学装备,2026,23(1):171-174.

1汪妍君.破茧[J].初中生写作,2025(10):22-23.
2姜红芳.“创客教育”背景下的小学信息科技教学[J].中小学电教(下),2025(7):76-78.
3杨春园.基于信息化的医务人员与护理人员协同管理优化研究[J].中国信息化,2025(9):42-43.
4杜玉霞.记叙文八种破茧术[J].作文成功之路,2025(27):32-34.
5党应聪.我命由我不由天[J].成才与就业,2025(9):37-37.
6郑雪.年产值300亿的资产保住了山东法院助力企业“破茧重生”[J].中国经济周刊,2025(19):76-77.
7卫培刚,曹姗姗,刘继芳,刘振虎,孙伟,孔繁涛.具身智能农业机器人:关键技术、应用分析、挑战与展望[J].智慧农业(中英文),2025,7(4):141-158. 被引量：1
8陈金超,朵延凯,孙良钰.新时代国企精益管理探索与实践创新的“破茧之路”--精益建设标杆企业经验分享[J].冶金经济与管理,2025(5):8-10.
9申起征.基于智能制造的机械零部件自动装配系统设计与优化[J].现代工业工程,2025(12):119-121.

科学通报

2025年第26期

浏览历史

内容加载中请稍等...

卫生健康行业垂直大模型破茧之基石——构建行业专业多模态语料库被引量：1

同被引文献32

引证文献1

相关作者

相关机构

相关主题

浏览历史

卫生健康行业垂直大模型破茧之基石——构建行业专业多模态语料库 被引量：1

同被引文献32

引证文献1

相关作者

相关机构

相关主题

浏览历史

卫生健康行业垂直大模型破茧之基石——构建行业专业多模态语料库被引量：1