期刊文献+
共找到1,491篇文章
< 1 2 75 >
每页显示 20 50 100
An alert-situation text data augmentation method based on MLM
1
作者 DING Weijie MAO Tingyun +3 位作者 CHEN Lili ZHOU Mingwei YUAN Ying HU Wentao 《High Technology Letters》 EI CAS 2024年第4期389-396,共8页
The performance of deep learning models is heavily reliant on the quality and quantity of train-ing data.Insufficient training data will lead to overfitting.However,in the task of alert-situation text classification,i... The performance of deep learning models is heavily reliant on the quality and quantity of train-ing data.Insufficient training data will lead to overfitting.However,in the task of alert-situation text classification,it is usually difficult to obtain a large amount of training data.This paper proposes a text data augmentation method based on masked language model(MLM),aiming to enhance the generalization capability of deep learning models by expanding the training data.The method em-ploys a Mask strategy to randomly conceal words in the text,effectively leveraging contextual infor-mation to predict and replace masked words based on MLM,thereby generating new training data.Three Mask strategies of character level,word level and N-gram are designed,and the performance of each Mask strategy under different Mask ratios is analyzed and studied.The experimental results show that the performance of the word-level Mask strategy is better than the traditional data augmen-tation method. 展开更多
关键词 deep learning text data augmentation masked language model(MLM) alert-sit-uation text classification
在线阅读 下载PDF
Quantitative Comparative Study of the Performance of Lossless Compression Methods Based on a Text Data Model
2
作者 Namogo Silué Sié Ouattara +1 位作者 Mouhamadou Dosso Alain Clément 《Open Journal of Applied Sciences》 2024年第7期1944-1962,共19页
Data compression plays a key role in optimizing the use of memory storage space and also reducing latency in data transmission. In this paper, we are interested in lossless compression techniques because their perform... Data compression plays a key role in optimizing the use of memory storage space and also reducing latency in data transmission. In this paper, we are interested in lossless compression techniques because their performance is exploited with lossy compression techniques for images and videos generally using a mixed approach. To achieve our intended objective, which is to study the performance of lossless compression methods, we first carried out a literature review, a summary of which enabled us to select the most relevant, namely the following: arithmetic coding, LZW, Tunstall’s algorithm, RLE, BWT, Huffman coding and Shannon-Fano. Secondly, we designed a purposive text dataset with a repeating pattern in order to test the behavior and effectiveness of the selected compression techniques. Thirdly, we designed the compression algorithms and developed the programs (scripts) in Matlab in order to test their performance. Finally, following the tests conducted on relevant data that we constructed according to a deliberate model, the results show that these methods presented in order of performance are very satisfactory:- LZW- Arithmetic coding- Tunstall algorithm- BWT + RLELikewise, it appears that on the one hand, the performance of certain techniques relative to others is strongly linked to the sequencing and/or recurrence of symbols that make up the message, and on the other hand, to the cumulative time of encoding and decoding. 展开更多
关键词 Arithmetic Coding BWT Compression Ratio Comparative Study Compression Techniques Shannon-Fano HUFFMAN Lossless Compression LZW PERFORMANCE REDUNDANCY RLE text data Tunstall
在线阅读 下载PDF
Clustering Text Data Streams 被引量:7
3
作者 刘玉葆 蔡嘉荣 +1 位作者 印鉴 傅蔚慈 《Journal of Computer Science & Technology》 SCIE EI CSCD 2008年第1期112-128,共17页
Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. Howe... Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. However, most methods are similarity-based approaches and only use the TF,IDF scheme to represent the semantics of text data and often lead to poor clustering quality. Recently, researchers argue that semantic smoothing model is more efficient than the existing TF,IDF scheme for improving text clustering quality. However, the existing semantic smoothing model is not suitable for dynamic text data context. In this paper, we extend the semantic smoothing model into text data streams context firstly. Based on the extended model, we then present two online clustering algorithms OCTS and OCTSM for the clustering of massive text data streams. In both algorithms, we also present a new cluster statistics structure named cluster profile which can capture the semantics of text data streams dynamically and at the same time speed up the clustering process. Some efficient implementations for our algorithms are also given. Finally, we present a series of experimental results illustrating the effectiveness of our technique. 展开更多
关键词 CLUSTERING database applications data mining text data streams
原文传递
Emotional and semantic analysis of landscape elements in heritage parks:insights from social media data on visitor perception
4
作者 SiYi Ren Xiaolong Chen 《Built Heritage》 2025年第3期60-79,共20页
Heritage parks preserve cultural heritage while contributing to education,recreation,and sustainable urban development.However,the relationships between landscape elements and visitors'emotional responses remain u... Heritage parks preserve cultural heritage while contributing to education,recreation,and sustainable urban development.However,the relationships between landscape elements and visitors'emotional responses remain underexplored.This study aims to bridge this gap by analysing social media reviews to understand how various landscape elements in heritage parks influence emotional experiences,offering insights for enhancing park services and promoting practical cultural heritage preservation.Using data from 63,288 visitor reviews(2018–2023)across five major social media platforms,this research focuses on five heritage parks in Xi'an as case studies.By employing latent Dirichlet allocation(LDA)topic modelling,sentiment analysis,and social network analysis(SNA),we identified five key categories of landscape elements:cultural landscapes,biological and natural environments,framework and service facilities,special activities,and architecture and infrastructure.Sentiment analysis revealed that cultural landscapes elicited the highest positive emotional response score(64.9%),reflecting their historical and aesthetic significance.Framework and service facilities had the highest emotional intensity score(8.09),emphasising their functional role in enhancing visitor satisfaction.In contrast,the biological and natural environments presented weaker emotional appeal(6.86).This study provides a novel framework linking emotional responses to specific landscape features,offering practical guidance for optimising heritage park management and supporting the preservation and promotion of cultural heritage. 展开更多
关键词 Heritage Tourism Heritage Park Emotional analysis Social media text data Landscape elements
原文传递
Identifying Scientific Project-generated Data Citation from Full-text Articles: An Investigation of TCGA Data Citation 被引量:4
5
作者 Jiao Li Si Zheng +2 位作者 Hongyu Kang Zhen Hou Qing Qian 《Journal of Data and Information Science》 2016年第2期32-44,共13页
Purpose: In the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library arc... Purpose: In the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library archive. It is challenging to identify the data usage that is mentioned in literature and associate it with its source. Here, we investigated the data usage of a government-funded cancer genomics project, The Cancer Genome Atlas(TCGA), via a full-text literature analysis.Design/methodology/approach: We focused on identifying articles using the TCGA dataset and constructing linkages between the articles and the specific TCGA dataset. First, we collected 5,372 TCGA-related articles from Pub Med Central(PMC). Second, we constructed a benchmark set with 25 full-text articles that truly used the TCGA data in their studies, and we summarized the key features of the benchmark set. Third, the key features were applied to the remaining PMC full-text articles that were collected from PMC.Findings: The amount of publications that use TCGA data has increased significantly since 2011, although the TCGA project was launched in 2005. Additionally, we found that the critical areas of focus in the studies that use the TCGA data were glioblastoma multiforme, lung cancer, and breast cancer; meanwhile, data from the RNA-sequencing(RNA-seq) platform is the most preferable for use.Research limitations: The current workflow to identify articles that truly used TCGA data is labor-intensive. An automatic method is expected to improve the performance.Practical implications: This study will help cancer genomics researchers determine the latest advancements in cancer molecular therapy, and it will promote data sharing and data-intensive scientific discovery.Originality/value: Few studies have been conducted to investigate data usage by governmentfunded projects/programs since their launch. In this preliminary study, we extracted articles that use TCGA data from PMC, and we created a link between the full-text articles and the source data. 展开更多
关键词 Scientific data Full-text literature Open access PubMed Central data citation
在线阅读 下载PDF
Automatic User Goals Identification Based on Anchor Text and Click-Through Data 被引量:6
6
作者 YUAN Xiaojie DOU Zhicheng ZHANG Lu LIU Fang 《Wuhan University Journal of Natural Sciences》 CAS 2008年第4期495-500,共6页
Understanding the underlying goal behind a user's Web query has been proved to be helpful to improve the quality of search. This paper focuses on the problem of automatic identification of query types according to th... Understanding the underlying goal behind a user's Web query has been proved to be helpful to improve the quality of search. This paper focuses on the problem of automatic identification of query types according to the goals. Four novel entropy-based features extracted from anchor data and click-through data are proposed, and a support vector machines (SVM) classifier is used to identify the user's goal based on these features. Experi- mental results show that the proposed entropy-based features are more effective than those reported in previous work. By combin- ing multiple features the goals for more than 97% of the queries studied can be correctly identified. Besides these, this paper reaches the following important conclusions: First, anchor-based features are more effective than click-through-based features; Second, the number of sites is more reliable than the number of links; Third, click-distribution- based features are more effective than session-based ones. 展开更多
关键词 query classification user goals anchor text click-through data information retrieval
在线阅读 下载PDF
A Complexity Analysis and Entropy for Different Data Compression Algorithms on Text Files 被引量:1
7
作者 Mohammad Hjouj Btoush Ziad E. Dawahdeh 《Journal of Computer and Communications》 2018年第1期301-315,共15页
In this paper, we analyze the complexity and entropy of different methods of data compression algorithms: LZW, Huffman, Fixed-length code (FLC), and Huffman after using Fixed-length code (HFLC). We test those algorith... In this paper, we analyze the complexity and entropy of different methods of data compression algorithms: LZW, Huffman, Fixed-length code (FLC), and Huffman after using Fixed-length code (HFLC). We test those algorithms on different files of different sizes and then conclude that: LZW is the best one in all compression scales that we tested especially on the large files, then Huffman, HFLC, and FLC, respectively. Data compression still is an important topic for research these days, and has many applications and uses needed. Therefore, we suggest continuing searching in this field and trying to combine two techniques in order to reach a best one, or use another source mapping (Hamming) like embedding a linear array into a Hypercube with other good techniques like Huffman and trying to reach good results. 展开更多
关键词 text FILES data Compression HUFFMAN Coding LZW Hamming ENTROPY COMPLEXITY
暂未订购
A feature representation method for biomedical scientific data based on composite text description
8
作者 SUN Wei 《Chinese Journal of Library and Information Science》 2009年第4期43-53,共11页
Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Ther... Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Therefore, the paper proposes a concept of composite text description(CTD) and a CTD-based feature representation method for biomedical scientific data. The method mainly uses different feature weight algorisms to represent candidate features based on two types of data sources respectively, combines and finally strengthens the two feature sets. Experiments show that comparing with traditional methods, the feature representation method is more effective than traditional methods and can significantly improve the performance of biomedcial data clustering. 展开更多
关键词 Composite text description Scientific data Feature representation Weight algorism
原文传递
Fast Data Processing of a Polarimeter-Interferometer System on J-TEXT
9
作者 刘煜锴 高丽 +3 位作者 刘海庆 杨曜 高翔 J-TEXT Team 《Plasma Science and Technology》 SCIE EI CAS CSCD 2016年第12期1143-1147,共5页
A method of fast data processing has been developed to rapidly obtain evolution of the electron density profile for a multichannel polarimeter-interferometer system(POLARIS)on J-TEXT. Compared with the Abel inversio... A method of fast data processing has been developed to rapidly obtain evolution of the electron density profile for a multichannel polarimeter-interferometer system(POLARIS)on J-TEXT. Compared with the Abel inversion method, evolution of the density profile analyzed by this method can quickly offer important information. This method has the advantage of fast calculation speed with the order of ten milliseconds per normal shot and it is capable of processing up to 1 MHz sampled data, which is helpful for studying density sawtooth instability and the disruption between shots. In the duration of a flat-top plasma current of usual ohmic discharges on J-TEXT, shape factor u is ranged from 4 to 5. When the disruption of discharge happens, the density profile becomes peaked and the shape factor u typically decreases to 1. 展开更多
关键词 fast data processing polarimeter-interferometer J-text
在线阅读 下载PDF
基于SVM和归一化熵模型的隐患文本分类与类型特征分析
10
作者 乔剑锋 刘萱 +2 位作者 艾莉莎 张丽玮 王汀 《重庆大学学报》 北大核心 2026年第2期105-115,共11页
为了提高隐患信息数据组织和检索的效率,支持更复杂的信息处理任务,需要采用有效技术手段对数据进行自动分类和类型分析。支持向量机(support vector machine,SVM)可以对自由文本进行自动分类,但是算法的工作原理是在训练集中寻找最优... 为了提高隐患信息数据组织和检索的效率,支持更复杂的信息处理任务,需要采用有效技术手段对数据进行自动分类和类型分析。支持向量机(support vector machine,SVM)可以对自由文本进行自动分类,但是算法的工作原理是在训练集中寻找最优分类边界,不能发现类型典型特征。为了分析类型样本的共同特征,提出采用归一化熵模型寻找类型典型特征,改进当前词频-逆文档频率(term frequency-inverse document frequency,TF-IDF)类型特征识别方法。以政府某应急管理局的2 534条执法检查记录为例,采用SVM进行自动分类,准确率高达97%。同时通过归一化熵模型给出各类型的典型特征,为制定隐患排查专项整治策略提供决策支持。实验结果表明,采用SVM和归一化熵模型的组合技术可以高效解决文本分类和类型特征识别的综合问题。 展开更多
关键词 文本挖掘 数据挖掘 隐患排查 支持向量机
在线阅读 下载PDF
融合类别描述与增强嵌入的煤矿安全风险预测模型研究
11
作者 杨超宇 黄大卫 《安全与环境学报》 北大核心 2026年第2期517-528,共12页
煤矿安全风险辨识文本包含丰富的风险特征描述与专家经验知识,深入挖掘这些文本对实现风险等级预测具有重要价值。针对风险辨识文本存在小样本、短文本及语义复杂问题,提出了一种融合类别描述与增强嵌入的煤矿安全风险预测模型。该方法... 煤矿安全风险辨识文本包含丰富的风险特征描述与专家经验知识,深入挖掘这些文本对实现风险等级预测具有重要价值。针对风险辨识文本存在小样本、短文本及语义复杂问题,提出了一种融合类别描述与增强嵌入的煤矿安全风险预测模型。该方法在句子级嵌入维度对文本进行数据增强,有效扩充训练样本;通过构建风险类别描述引入煤矿领域知识,并利用注意力机制对风险类别描述进行动态融合,为煤矿安全风险样本补充专业知识;使用双向长短期记忆(Bidirectional Long Short-Term Memory,Bi-LSTM)网络与Mamba算法对原始文本特征进行深度提取,获取煤矿文本复杂情境下的核心特征;最后使用动态门控机制融合各模块特征,输出预测结果。研究表明,该模型在小规模煤矿风险辨识数据集上准确率和F1均有不错的表现,可基于煤矿安全风险辨识文本为煤矿安全风险等级预测提供支持。 展开更多
关键词 安全工程 风险预测 煤矿安全 小样本 短文本 数据增强 特征融合
原文传递
EOS Data Dumper——EOS免费数据自动下载与重发布系统 被引量:5
12
作者 南卓铜 王亮绪 李新 《冰川冻土》 CSCD 北大核心 2007年第3期463-469,共7页
为了更有效的利用已有数据资源,不造成科研设施的重复投资,数据共享越来越受到重视.NASA对地观测系统(EOS)提供了大量的包括MODIS在内的免费数据资源,为此,EOS Data Dumper(EDD)通过程序模拟EOS数据门户的正常下载流程,采用了先进的Web... 为了更有效的利用已有数据资源,不造成科研设施的重复投资,数据共享越来越受到重视.NASA对地观测系统(EOS)提供了大量的包括MODIS在内的免费数据资源,为此,EOS Data Dumper(EDD)通过程序模拟EOS数据门户的正常下载流程,采用了先进的Web页面文本信息捕捉技术,实现定时自动下载研究区的全部EOS免费数据,并通过免费的DIAL系统,向互联网重新发布,实现复杂的基于时空的数据查询.从技术角度详细介绍了EDD的项目背景与意义、实现方案。 展开更多
关键词 EOS数据 遥感影像数据 文本信息捕捉 数据共享
在线阅读 下载PDF
大语言模型驱动的图书馆业务数据对话系统设计与实现
13
作者 张光照 王忠义 +2 位作者 王楠 张银玲 杨帆 《图书馆论坛》 北大核心 2026年第3期124-134,共11页
探索微调大语言模型的方法,使其能通过自然语言描述需求生成SQL查询语句,获取图书馆业务数据库报表,以解决现有业务系统查询功能固化,难以应对实际工作中多元化的需求的问题。文章根据图书馆各业务部门报表分析和决策需求,针对图书馆业... 探索微调大语言模型的方法,使其能通过自然语言描述需求生成SQL查询语句,获取图书馆业务数据库报表,以解决现有业务系统查询功能固化,难以应对实际工作中多元化的需求的问题。文章根据图书馆各业务部门报表分析和决策需求,针对图书馆业务数据库和中图分类法特征制作图书馆领域Text-to-SQL数据训练集和测试集;基于6-7B参数量选取国内外典型的开源中等参数规模大语言模型在训练集上进行LoRA微调,而后在测试集上验证SQL查询语句生成有效性,最后利用FastChat部署模型和研发可视化交互系统实现交互能力并在真实场景中分析应用效果。研究发现,在Chatglm3-6B模型微调中取得了0.9426的执行准确度,并在真实业务需求交互场景中探讨了微调后的中等参数规模大语言模型与超大规模参数模型Qwen-Max的应用效果,证明采用Textto-SQL微调大语言模型用于图书馆业务数据自然语言查询的有效性。 展开更多
关键词 图书馆业务数据 text-to-SQL 大语言模型 数据库对话系统 微调
在线阅读 下载PDF
Semi-Supervised Learning in Large Scale Text Categorization 被引量:1
14
作者 许泽文 李建强 +3 位作者 刘博 毕敬 李蓉 毛睿 《Journal of Shanghai Jiaotong university(Science)》 EI 2017年第3期291-302,共12页
The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately beca... The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately because of its huge number. Automatic text classification technology based on machine learning can classify a large number of natural language documents into the corresponding subject categories according to its correct semantics. It is helpful to grasp the text information directly. By learning from a set of hand-labeled documents,we obtain the traditional supervised classifier for text categorization(TC). However, labeling all data by human is labor intensive and time consuming. To solve this problem, some scholars proposed a semi-supervised learning method to train classifier, but it is unfeasible for various kinds and great number of Web data since it still needs a part of hand-labeled data. In 2012, Li et al. invented a fully automatic categorization approach for text(FACT)based on supervised learning, where no manual labeling efforts are required. But automatically labeling all data can bring noise into experiment and cause the fact that the result cannot meet the accuracy requirement. We put forward a new idea that part of data with high accuracy can be automatically tagged based on the semantic of category name, then a semi-supervised way is taken to train classifier with both labeled and unlabeled data,and ultimately a precise classification of massive text data can be achieved. The empirical experiments show that the method outperforms the supervised support vector machine(SVM) in terms of both F1 performance and classification accuracy in most cases. It proves the effectiveness of the semi-supervised algorithm in automatic TC. 展开更多
关键词 text data mining SEMI-SUPERVISED automatic tagging CLASSIFIER
原文传递
融合多级语义的中文医疗短文本分类模型
15
作者 杨杰 刘纳 +2 位作者 郑国风 李晨 道路 《郑州大学学报(理学版)》 北大核心 2026年第1期51-57,共7页
针对医疗短文本分类中关键语义信息提取不足与模型鲁棒性下降的问题,提出了融合多级语义信息的文本分类模型。首先,利用预训练模型捕获文本的初步语义特征。其次,通过胶囊网络提取关键语义信息,确保模型能够有效学习到短文本中的核心语... 针对医疗短文本分类中关键语义信息提取不足与模型鲁棒性下降的问题,提出了融合多级语义信息的文本分类模型。首先,利用预训练模型捕获文本的初步语义特征。其次,通过胶囊网络提取关键语义信息,确保模型能够有效学习到短文本中的核心语义;采用注意力池化技术聚焦文本中的文档级信息,增强对医学专业术语和概念的识别与理解。最后,引入对抗训练策略,提升模型在面对模糊表达或扰动输入时的稳定性和准确性。在CHIP-CTC、KUAKE_QIC和VSQ三个医疗文本分类数据集上验证了模型的有效性,结果表明,相较于现有模型,所提模型在三个数据集上的F 1值均有所提升,显著增强了中文医疗短文本的分类性能。 展开更多
关键词 中文医疗数据 短文本分类 语义融合 胶囊网络 注意力池化
在线阅读 下载PDF
数据训练中的版权开放许可规则及其实现路径
16
作者 李倩 沈立苏 《信息安全研究》 北大核心 2026年第1期68-74,共7页
生成式人工智能训练对海量作品的依赖引发版权侵权风险,欧盟、美国与日本等法域通过创新文本与数据挖掘例外等规则予以规制.尽管适当允许利用作品进行数据训练已基本成为国内理论共识,但其具体的合规路径仍存在较大争议.研究发现,应在... 生成式人工智能训练对海量作品的依赖引发版权侵权风险,欧盟、美国与日本等法域通过创新文本与数据挖掘例外等规则予以规制.尽管适当允许利用作品进行数据训练已基本成为国内理论共识,但其具体的合规路径仍存在较大争议.研究发现,应在数据训练中引入版权开放许可机制,以自主声明替代逐件授权,并通过合理利益分配与透明监管体系激励权利人参与,构建权利保护与技术创新的动态平衡.基于作品自动受保护、数量庞杂的特点,应明确版权开放许可声明的公示效力,保护善意第三人的信赖利益,并允许版权人对其系列作品进行集合许可,以更好适应智能时代数据密集型利用的现实需求. 展开更多
关键词 数据训练 文本与数据挖掘 著作权 开放许可 公示效力
在线阅读 下载PDF
基于文本匹配情感原型池与跨模态共享维度空间的多模态数据融合方法
17
作者 黄竞泽 诸佳炜 王瑞 《工业控制计算机》 2026年第2期48-50,共3页
当今时代社交媒体和移动设备广泛普及,文本、图像、音频等多模态数据被充分运用来表达情感,对多模态情感识别提出了新的挑战。而传统方法在模态融合过程中存在着模态交互关系挖掘不足,表征分布差异等问题。为此,提出了一种基于文本匹配... 当今时代社交媒体和移动设备广泛普及,文本、图像、音频等多模态数据被充分运用来表达情感,对多模态情感识别提出了新的挑战。而传统方法在模态融合过程中存在着模态交互关系挖掘不足,表征分布差异等问题。为此,提出了一种基于文本匹配情感原型池与共享维度空间的多模态数据融合方法,通过动态生成情感原型池,集合跨模态共享维度空间机制,有效剔除非关键特征并捕捉模态间的复杂关联。实验表明,该模型在MOSI和MOSEI数据集上的情感分类准确率、F1值、MAE值、Corr值都优于现有方法。通过消融实验进一步验证,文本原型引导的匹配情感原型池与跨模态共享维度空间结构,对性能提升起到关键作用,为多模态情感识别任务提供了创新的解决方案。 展开更多
关键词 多模态数据融合 文本主导 多模态情绪识别
在线阅读 下载PDF
基于强化策略反馈的多模态自适应实体识别方法
18
作者 焦明海 樊本航 +1 位作者 王静 彭玉怀 《计算机研究与发展》 北大核心 2026年第2期294-304,共11页
命名实体识别(named entity recognition,NER)的核心目标是从非结构化文本中识别出具有特定语义类别的实体与类型。随着社交媒体的迅速发展,文本信息往往与视觉信息共同出现,形成多模态内容。为了提升实体识别的准确性,多模态命名实体识... 命名实体识别(named entity recognition,NER)的核心目标是从非结构化文本中识别出具有特定语义类别的实体与类型。随着社交媒体的迅速发展,文本信息往往与视觉信息共同出现,形成多模态内容。为了提升实体识别的准确性,多模态命名实体识别(multi-modal NER,MNER)方法利用不同模态中的语义信息,实现信息互补与深度融合。然而,不同模态之间的表征差异可能引入视觉噪声,干扰实体识别。文本模态中存在实体指代不清或上下文语义模糊的问题,增加了识别难度。针对上述问题,提出了一种基于强化策略反馈与自适应损失机制的MNER方法。首先,该方法采用基于GPT-4o的3阶段思维链(chain of thought,COT)推理流程,形成渐进式推理框架,融合强化学习中的自适应反馈机制,对图像与文本之间的匹配程度进行评分,并利用自适应决策函数有效过滤视觉噪声的干扰。其次,设计了4类面向具体任务的损失函数,并利用自适应加权融合策略进行优化,以缓解上下文模糊带来的识别不确定性。在2个公开数据集Twitter-2015和Twitter-2017上开展实验,结果表明所提方法的总体F1分数分别达到86.45%和93.80%,显著优于当前主流基线模型。 展开更多
关键词 命名实体识别 非结构化文本 多模态数据 强化学习 自适应损失 思维链
在线阅读 下载PDF
A Comparative Study on Two Techniques of Reducing the Dimension of Text Feature Space
19
作者 Yin Zhonghang, Wang Yongcheng, Cai Wei & Diao Qian School of Electronic & Information Technology, Shanghai Jiaotong University, Shanghai 200030, P.R.China 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2002年第1期87-92,共6页
With the development of large scale text processing, the dimension of text feature space has become larger and larger, which has added a lot of difficulties to natural language processing. How to reduce the dimension... With the development of large scale text processing, the dimension of text feature space has become larger and larger, which has added a lot of difficulties to natural language processing. How to reduce the dimension has become a practical problem in the field. Here we present two clustering methods, i.e. concept association and concept abstract, to achieve the goal. The first refers to the keyword clustering based on the co occurrence of 展开更多
关键词 in the same text and the second refers to that in the same category. Then we compare the difference between them. Our experiment results show that they are efficient to reduce the dimension of text feature space. Keywords: text data mining
在线阅读 下载PDF
黑龙江制造业数字化转型现状及测度分析——基于上市企业年报文本数据挖掘
20
作者 张彩云 孙军 《中国商论》 2026年第4期156-160,共5页
在全球数字经济加速发展的背景下,制造业数字化转型已成为推动区域经济高质量发展的重要途径。本文以黑龙江制造业上市企业为研究对象,基于2017—2023年的企业年报文本数据,通过构建“核心数字技术—数字化生产方式—业务模式转型”三... 在全球数字经济加速发展的背景下,制造业数字化转型已成为推动区域经济高质量发展的重要途径。本文以黑龙江制造业上市企业为研究对象,基于2017—2023年的企业年报文本数据,通过构建“核心数字技术—数字化生产方式—业务模式转型”三维测度体系,并运用文本挖掘和词频分析方法,系统评估其数字化转型水平。研究表明:黑龙江制造业数字化转型整体呈稳步上升趋势,但存在明显的区域与行业异质性,哈尔滨在多项数字技术应用中领先,而多数地区仍处于初步阶段;企业普遍面临数据孤岛、技术薄弱、人才短缺等挑战。基于研究结论,本文提出针对性建议,以期推动形成多点支撑、协同共进的数字化制造新格局。 展开更多
关键词 制造业 数字化转型 文本数据挖掘 词频分析法 人工智能 云计算 黑龙江
在线阅读 下载PDF
上一页 1 2 75 下一页 到第
使用帮助 返回顶部