期刊文献+
共找到25篇文章
< 1 2 >
每页显示 20 50 100
基于生成对抗网络和扩散模型的人脸年龄编辑综述 被引量:1
1
作者 金家立 高思远 +3 位作者 高满达 王文彬 柳绍祯 孙哲南 《数据与计算发展前沿(中英文)》 2025年第1期38-55,共18页
【目的】近年来,深度生成模型在人脸年龄编辑任务中取得了显著进展,本文对基于生成对抗网络和扩散模型等深度生成模型的人脸年龄编辑方法进行汇总。【方法】本文首先介绍人脸年龄编辑的基本概念、相关数据集、评价指标,然后分析常用的... 【目的】近年来,深度生成模型在人脸年龄编辑任务中取得了显著进展,本文对基于生成对抗网络和扩散模型等深度生成模型的人脸年龄编辑方法进行汇总。【方法】本文首先介绍人脸年龄编辑的基本概念、相关数据集、评价指标,然后分析常用的生成对抗网络、扩散模型以及其变体在年龄编辑任务中的应用,归纳现有模型在年龄准确性、身份一致性、生成图像质量等方面的性能表现,并讨论不同评价指标的适用性。【结果】基于生成对抗网络和扩散模型的年龄编辑技术已经在生成图像的质量和年龄预测的准确性上取得了显著进展,但在处理较大年龄跨度时,面部细节的生成仍存在不足。【结论】未来的人脸年龄编辑研究可以通过开发更大规模、更高质量的数据集,结合3D人脸重建技术和扩散模型高效的采样算法,进一步提升模型的生成能力和应用效果。 展开更多
关键词 深度学习 生成对抗网络 扩散模型 属性编辑 人脸年龄编辑
在线阅读 下载PDF
重载双摆工作台误差建模及补偿方法
2
作者 李锁庄 包瑞 +3 位作者 李硕 贾琨 叶范亭 刘漫贤 《制造技术与机床》 北大核心 2025年第11期156-165,共10页
针对空间受限的重载双摆工作台因耦合误差导致难以高效、精确控制的问题,提出基于刚体运动假设的耦合误差建模与补偿方法。通过对加工工艺与装配关系分析俯仰角和翻滚角耦合误差的形成机理进行分析,确定影响位姿控制精度的几何误差关键... 针对空间受限的重载双摆工作台因耦合误差导致难以高效、精确控制的问题,提出基于刚体运动假设的耦合误差建模与补偿方法。通过对加工工艺与装配关系分析俯仰角和翻滚角耦合误差的形成机理进行分析,确定影响位姿控制精度的几何误差关键参数,并修正运动学模型。根据机构关系建立耦合误差模型,借助数值仿真分析独立误差与耦合误差的分布特征,进而提出位姿误差补偿策略及方法。在完成误差参数辨识后,运用均匀试验设计方法进一步验证误差补偿模型的有效性。实验结果表明,基于几何误差模型的误差补偿方法能显著提升位姿控制精度和效率,为设计和制造空间受限条件下的重载双摆工作台奠定了理论基础。 展开更多
关键词 空间受限 重载 双摆工作台 耦合误差 误差补偿
在线阅读 下载PDF
局部形状特征概率混合的半自动三维点云分类 被引量:7
3
作者 李红军 刘欣莹 +1 位作者 张晓鹏 严冬明 《浙江大学学报(理学版)》 CAS CSCD 北大核心 2017年第1期1-9,共9页
三维激光扫描获取的点云数据可用于数字城市建设、三维模型获取、场景分析与物体测量等领域.但因遮挡和噪声的影响,加之扫描场景复杂,采样精度受限,使得不能直接运用经典的曲面和三维空间理论对点云数据进行有效分析和处理.分类是点云... 三维激光扫描获取的点云数据可用于数字城市建设、三维模型获取、场景分析与物体测量等领域.但因遮挡和噪声的影响,加之扫描场景复杂,采样精度受限,使得不能直接运用经典的曲面和三维空间理论对点云数据进行有效分析和处理.分类是点云数据预处理的重要方式之一.提取近邻四面体体积、近邻法向量差异度、主方向差异度和主曲率值4个局部形状特征,采用概率混合策略构建了一种点云数据的半自动分类方法,可实现平面点集、柱面点集和其他点集的有效区分.其中,概率混合策略是依据近邻点平均距离和单指标类别一致程度估计每个特征推断形状的概率,通过混合加权,依据概率赋权函数最大值准则进行局部形状推断.可实现用户交互,以便处理不同扫描尺度和精度的点云数据.采用本文方法对模拟生成的点云、单棵树木点云、街道场景点云、旷野自然场景扫描点云以及航空机载扫描点云等多组数据进行了实验,结果表明,基于局部形状特征的概率混合方法对各种点云数据均具有良好的分类效果. 展开更多
关键词 点云分类 局部形状推断 概率混合 法向量差异 主曲率 大规模场景分析
在线阅读 下载PDF
三维人脸成像及重建技术综述 被引量:1
4
作者 刘菲 张堃博 +3 位作者 杨青 周树波 王云龙 孙哲南 《中国图象图形学报》 CSCD 北大核心 2024年第9期2441-2470,共30页
得益于新型三维视觉测量技术及深度学习模型的飞速发展,三维视觉成为人工智能、虚拟现实等领域的重要支撑技术,三维人脸成像及重建技术取得了突破性进展,不仅能够更好地应对光照、遮挡、表情和姿态等变化,同时增大了伪造攻击难度,大大... 得益于新型三维视觉测量技术及深度学习模型的飞速发展,三维视觉成为人工智能、虚拟现实等领域的重要支撑技术,三维人脸成像及重建技术取得了突破性进展,不仅能够更好地应对光照、遮挡、表情和姿态等变化,同时增大了伪造攻击难度,大大推动了真实感“虚拟数字人”的重建与渲染,有效提升了人脸系统的安全性。本文对三维人脸成像技术和重建模型进行了全面综述,尤其对基于深度学习的三维人脸重建进行系统深入地分析。首先,对三维人脸成像设备及采集系统进行详细梳理及对比归纳,并介绍了基于新传感技术的人脸成像系统;然后,对基于深度学习的三维人脸重建模型进行系统分析,从输入数据源角度分为基于单目图像、基于多目图像、基于视频和基于语音的三维人脸重建算法4类。通过深入分析,总结三维人脸成像的研究现状及面临的难点与挑战,对未来发展方向及应用进行积极探讨与展望。本文涵盖了近5年经典的三维人脸成像及重建相关的技术与研究,为人脸研究、发展和应用提供了很好的参考。 展开更多
关键词 三维人脸成像 三维人脸重建 深度学习(DL) 生成对抗网络(GAN) 隐式神经表示(INR)
原文传递
融合多种特征的实体链接技术研究
5
作者 陈玉博 何世柱 +2 位作者 刘康 赵军 吕学强 《中文信息学报》 CSCD 北大核心 2016年第4期176-183,共8页
实体消歧是自然语言理解的重要研究内容,旨在解决文本信息中普遍存在的命名实体歧义问题,在信息抽取、知识工程和语义网络等领域有广泛的应用价值。实体链接是实体消歧的一种重要方法,该方法将具有歧义的实体指称项链接到给定的知识库... 实体消歧是自然语言理解的重要研究内容,旨在解决文本信息中普遍存在的命名实体歧义问题,在信息抽取、知识工程和语义网络等领域有广泛的应用价值。实体链接是实体消歧的一种重要方法,该方法将具有歧义的实体指称项链接到给定的知识库中从而实现实体歧义的消除[1]。传统的实体链接方法主要利用上下文的词语匹配等表层特征,缺乏深层语义信息,针对这一问题,该文提出的实体链接方法利用了多种特征,从不同的维度捕获语义信息。为了更好地融合各个维度的特征,该文利用了基于排序学习框架的实体链接方法,与传统的方法相比,节省了人工对大量的模型参数选择和调节的工作,与基于分类的方法相比,能更好地利用到候选之间的关系信息。在TAC-KBP-2009的实体链接评测数据上的实验表明,该文提出的特征和方法表现出良好的性能,在评测指标上高出参赛队伍最好水平2.21%,达到84.38%。 展开更多
关键词 实体消歧 实体链接 排序学习
在线阅读 下载PDF
基于深度学习的合成孔径雷达图像去噪综述 被引量:4
6
作者 雷钰 刘帅奇 +2 位作者 张璐瑶 刘彤 赵杰 《兵器装备工程学报》 CAS CSCD 北大核心 2022年第11期71-80,共10页
传统的合成孔径雷达图像去噪算法在细节保存能力和运行时间上存在局限性,而深度学习方法具有独特优势。通过对国内外有关文献的归纳和总结,分析了基于深度学习的合成孔径雷达图像去噪算法的理论基础和优缺点,阐述了网络模型的具体实现... 传统的合成孔径雷达图像去噪算法在细节保存能力和运行时间上存在局限性,而深度学习方法具有独特优势。通过对国内外有关文献的归纳和总结,分析了基于深度学习的合成孔径雷达图像去噪算法的理论基础和优缺点,阐述了网络模型的具体实现细节。从监督模型和自监督模型方面对合成孔径雷达去噪算法进行分类。叙述了去噪算法的训练及测试过程,包括训练及测试数据的、训练过程中常用的损失函数和分析、模拟及具体测试数据评价指标;展望了合成孔径雷达图像散斑抑制的研究方向。 展开更多
关键词 合成孔径雷达图像 相干斑抑制 深度学习 卷积神经网络 图像去噪
在线阅读 下载PDF
基于局部优化的重心Voronoi图计算 被引量:1
7
作者 叶畋宇 王逸群 +1 位作者 严冬明 雍俊海 《系统仿真学报》 CAS CSCD 北大核心 2019年第2期218-226,共9页
重心Voronoi图(centroidal Voronoi tessellation,CVT)是一个重要的几何结构,在地理信息系统,信号处理,网格生成/优化,可视化等领域有着重要应用。针对传统全局生成、优化的方法的不足,比如奇异点多、收敛速度较慢等问题,提出了生成优... 重心Voronoi图(centroidal Voronoi tessellation,CVT)是一个重要的几何结构,在地理信息系统,信号处理,网格生成/优化,可视化等领域有着重要应用。针对传统全局生成、优化的方法的不足,比如奇异点多、收敛速度较慢等问题,提出了生成优化与随机扰动两种局部优化方法,以及一个整合了层次生成、局部优化、蒙特卡罗优化的CVT生成算法框架。实验结果表明,相比于已有算法,本文方法在速度与质量上有综合的提升。 展开更多
关键词 重心Voronoi图 局部极小 优化 奇异点
原文传递
基于场对齐质心Voronoi划分的四边网格生成 被引量:2
8
作者 杜兴逸 严冬明 +1 位作者 叶军涛 张慧 《计算机辅助设计与图形学学报》 EI CSCD 北大核心 2018年第5期764-771,共8页
为了生成高质量的四边网格,提出一种基于场对齐质心Voronoi划分(centroidal Voronoi tessellation,CVT)优化的四边网格生成方法.首先通过优化CVT能量函数将输出网格顶点均匀地分布在输入网格表面;然后利用场对齐CVT优化得到网格边与输... 为了生成高质量的四边网格,提出一种基于场对齐质心Voronoi划分(centroidal Voronoi tessellation,CVT)优化的四边网格生成方法.首先通过优化CVT能量函数将输出网格顶点均匀地分布在输入网格表面;然后利用场对齐CVT优化得到网格边与输入方向场对齐的三角网格;再通过网格边-场方向匹配初步提取四边网格,并基于拓扑模式进行奇异点的识别与消除;最后利用三角形配对得到准四边网格.实验结果表明,该方法能够生成对齐方向场且质量较高的准四边网格. 展开更多
关键词 四边网格 CVT 方向场 拓扑优化
在线阅读 下载PDF
基于全局时空编码网络的猴类动物行为识别 被引量:2
9
作者 孙峥 张素才 马喜波 《图学学报》 CSCD 北大核心 2022年第5期832-840,共9页
猴类动物行为的准确量化是临床前药物安全评价的一个基本目标。视频中猴类动物行为分析的一个重要路径是使用目标的骨架序列信息,然而现有的大部分骨架行为识别方法通常在时间和空间维度分别提取骨架序列的特征,忽略了骨架拓扑结构在时... 猴类动物行为的准确量化是临床前药物安全评价的一个基本目标。视频中猴类动物行为分析的一个重要路径是使用目标的骨架序列信息,然而现有的大部分骨架行为识别方法通常在时间和空间维度分别提取骨架序列的特征,忽略了骨架拓扑结构在时空维度的整体性。针对该问题,提出了一种基于全局时空编码网络(GSTEN)的骨架行为识别方法。该方法在时空图卷积网络(ST-GCN)的基础上,并行插入全局标志生成器(GTG)和全局时空编码器(GSTE)来提取时间和空间维度的全局特征。为了验证提出的GSTEN性能,在自建的猴类动物行为识别数据集上开展实验。实验结果表明,该网络在基本不增加模型参数量的情况下,准确率指标达到76.54%,相较于基准模型ST-GCN提升6.79%。 展开更多
关键词 行为识别 骨架序列 全局时空编码网络 猴类动物 药物安全评价
在线阅读 下载PDF
Deep Audio-visual Learning:A Survey 被引量:5
10
作者 Hao Zhu Man-Di Luo +2 位作者 Rui Wang Ai-Hua Zheng Ran He 《International Journal of Automation and computing》 EI CSCD 2021年第3期351-376,共26页
Audio-visual learning,aimed at exploiting the relationship between audio and visual modalities,has drawn considerable attention since deep learning started to be used successfully.Researchers tend to leverage these tw... Audio-visual learning,aimed at exploiting the relationship between audio and visual modalities,has drawn considerable attention since deep learning started to be used successfully.Researchers tend to leverage these two modalities to improve the performance of previously considered single-modality tasks or address new challenging problems.In this paper,we provide a comprehensive survey of recent audio-visual learning development.We divide the current audio-visual learning tasks into four different subfields:audiovisual separation and localization,audio-visual correspondence learning,audio-visual generation,and audio-visual representation learning.State-of-the-art methods,as well as the remaining challenges of each subfield,are further discussed.Finally,we summarize the commonly used datasets and challenges. 展开更多
关键词 Deep audio-visual learning audio-visual separation and localization correspondence learning generative models representation learning
原文传递
PokerNet:Expanding Features Cheaply via Depthwise Convolutions 被引量:1
11
作者 Wei Tang Yan Huang Liang Wang 《International Journal of Automation and computing》 EI CSCD 2021年第3期432-442,共11页
Pointwise convolution is usually utilized to expand or squeeze features in modern lightweight deep models.However,it takes up most of the overall computational cost(usually more than 90%).This paper proposes a novel P... Pointwise convolution is usually utilized to expand or squeeze features in modern lightweight deep models.However,it takes up most of the overall computational cost(usually more than 90%).This paper proposes a novel Poker module to expand features by taking advantage of cheap depthwise convolution.As a result,the Poker module can greatly reduce the computational cost,and meanwhile generate a large number of effective features to guarantee the performance.The proposed module is standardized and can be employed wherever the feature expansion is needed.By varying the stride and the number of channels,different kinds of bottlenecks are designed to plug the proposed Poker module into the network.Thus,a lightweight model can be easily assembled.Experiments conducted on benchmarks reveal the effectiveness of our proposed Poker module.And our Poker Net models can reduce the computational cost by 7.1%-15.6%.Poker Net models achieve comparable or even higher recognition accuracy than previous state-of-the-art(SOTA)models on the Image Net ILSVRC2012 classification dataset.Code is available at https://github.com/diaomin/pokernet. 展开更多
关键词 Deep learning depthwise convolution lightweight deep model model compression model acceleration
原文传递
Video-Based Crowd Density Estimation and Prediction System for Wide-Area Surveillance 被引量:2
12
作者 曹黎俊 黄凯奇 《China Communications》 SCIE CSCD 2013年第5期79-88,共10页
Crowd density estimation in wide areas is a challenging problem for visual surveillance. Because of the high risk of degeneration, the safety of public events involving large crowds has always been a major concern. In... Crowd density estimation in wide areas is a challenging problem for visual surveillance. Because of the high risk of degeneration, the safety of public events involving large crowds has always been a major concern. In this paper, we propose a video-based crowd density analysis and prediction system for wide-area surveillance applications. In monocular image sequences, the Accumulated Mosaic Image Difference (AMID) method is applied to extract crowd areas having irregular motion. The specific number of persons and velocity of a crowd can be adequately estimated by our system from the density of crowded areas. Using a multi-camera network, we can obtain predictions of a crowd's density several minutes in advance. The system has been used in real applications, and numerous experiments conducted in real scenes (station, park, plaza) demonstrate the effectiveness and robustness of the proposed method. 展开更多
关键词 crowd density estimation prediction system AMID visual surveillance
在线阅读 下载PDF
Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection
13
作者 Cong Pan Junran Peng Zhaoxiang Zhang 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2024年第3期673-689,共17页
Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input t... Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts. 展开更多
关键词 Monocular 3D object detection normalizing flows Swin Transformer
在线阅读 下载PDF
二次曲线图像自动鲁棒提取研究
14
作者 吴劭桓 郭世毅 《电子工业专用设备》 2019年第1期66-71,共6页
对图像中的二次曲线进行自动提取有着广泛的应用价值,可用于相机参数的标定、机器人定位、增强现实等众多领域。目前已有的方法通常是对单个的二次曲线进行提取,并且在单个二次曲线提取中,当二次曲线不完备时,提取通常会失效。对此提出... 对图像中的二次曲线进行自动提取有着广泛的应用价值,可用于相机参数的标定、机器人定位、增强现实等众多领域。目前已有的方法通常是对单个的二次曲线进行提取,并且在单个二次曲线提取中,当二次曲线不完备时,提取通常会失效。对此提出了一种对多个二次曲线图像自动拟合的算法,创新性包括:(1)提出一种将高斯消元与Hough变换相结合的二次曲线检测方法;(2)提出将RANSAC与Hough变换相结合进行二次曲线检测;(3)应用一种新的几何距离进行二次曲线拟合。该方法可以处理数据不完备或者冗余的情况,不仅适用于单个二次曲线自动拟合,也适用于多个二次曲线拟合。与已有方法相比,实验结果表明该方法的高效性。 展开更多
关键词 二次曲线拟合 二次曲线检测 霍夫变换
在线阅读 下载PDF
TENET:Beyond Pseudo-labeling for Semi-supervised Few-shot Learning
15
作者 Chengcheng Ma Weiming Dong Changsheng Xu 《Machine Intelligence Research》 2025年第3期511-523,共13页
Few-shot learning attempts to identify novel categories by exploiting limited labeled training data,while the performances of existing methods still have much room for improvement.Thanks to a very low cost,many recent... Few-shot learning attempts to identify novel categories by exploiting limited labeled training data,while the performances of existing methods still have much room for improvement.Thanks to a very low cost,many recent methods resort to additional unlabeled training data to boost performance,known as semi-supervised few-shot learning(SSFSL).The general idea of SSFSL methods is to first generate pseudo labels for all unlabeled data and then augment the labeled training set with selected pseudo-labeled data.However,almost all previous SSFSL methods only take supervision signal from pseudo-labeling,ignoring that the distribution of training data can also be utilized as an effective unsupervised regularization.In this paper,we propose a simple yet effective SSFSL method named feature reconstruction based regression method(TENET),which takes low-rank feature reconstruction as the unsupervised objective function and pseudo labels as the supervised constraint.We provide several theoretical insights on why TENET can mitigate overfitting on low-quality training data,and why it can enhance the robustness against inaccurate pseudo labels.Extensive experiments on four popular datasets validate the effectiveness of TENET. 展开更多
关键词 Semi-supervised few-shot learning few-shot learning pseudo-labeling linear regression low-rank reconstruction
原文传递
CMSL:Cross-modal Style Learning for Few-shot Image Generation
16
作者 Yue Jiang Yueming Lyu +2 位作者 Bo Peng Wei Wang Jing Dong 《Machine Intelligence Research》 2025年第4期752-768,共17页
Training generative adversarial networks is data-demanding,which limits the development of these models on target domains with inadequate training data.Recently,researchers have leveraged generative models pretrained ... Training generative adversarial networks is data-demanding,which limits the development of these models on target domains with inadequate training data.Recently,researchers have leveraged generative models pretrained on sufficient data and fine-tuned them using small training samples,thus reducing data requirements.However,due to the lack of explicit focus on target styles and disproportionately concentrating on generative consistency,these methods do not perform well in diversity preservation which represents the adaptation ability for few-shot generative models.To mitigate the diversity degradation,we propose a framework with two key strategies:1)To obtain more diverse styles from limited training data effectively,we propose a cross-modal module that explicitly obtains the target styles with a style prototype space and text-guided style instructions.2)To inherit the generation capability from the pretrained model,we aim to constrain the similarity between the generated and source images with a structural discrepancy alignment module by maintaining the structure correlation in multiscale areas.We demonstrate the effectiveness of our method,which outperforms state-of-the-art methods in mitigating diversity degradation through extensive experiments and analyses. 展开更多
关键词 Few-shot image generation cross-modal learning prototype learning contrastive learning computer vision
原文传递
AdaGPAR:Generalizable Pedestrian Attribute Recognition via Test-time Adaptation
17
作者 Da Li Zhang Zhang +2 位作者 Yifan Zhang Zhen Jia Caifeng Shan 《Machine Intelligence Research》 2025年第4期783-796,共14页
Generalizable pedestrian attribute recognition(PAR)aims to learn a robust PAR model that can be directly adapted to unknown distributions under varying illumination,different viewpoints and occlusions,which is an esse... Generalizable pedestrian attribute recognition(PAR)aims to learn a robust PAR model that can be directly adapted to unknown distributions under varying illumination,different viewpoints and occlusions,which is an essential problem for real-world applications,such as video surveillance and fashion search.In practice,when a trained PAR model is deployed to real-world scenarios,the unseen target samples are fed into the model continuously in an online manner.Therefore,this paper proposes an efficient and flexible method,named AdaGPAR,for generalizable PAR(GPAR)via test-time adaptation(TTA),where we adapt the trained model through exploiting the unlabeled target samples online during the test phase.As far as we know,it is the first work that solves the GPAR from the perspective of TTA.In particular,the proposed AdaGPAR memorizes the reliable target sample pairs(features and pseudo-labels)as prototypes gradually in the test phase.Then,it makes predictions with a non-parametric classifier by calculating the similarity between a target instance and the prototypes.However,since PAR is a task of multi-label classification,only using the same holistic feature of one pedestrian image as the prototypes of multiple attributes is not optimal.Therefore,an attribute localization branch is introduced to extract the attribute-specific features,where two kinds of memory banks are further constructed to cache the global and attribute-specific features simultaneously.In summary,the AdaGPAR is training-free in the test phase and predicts multiple pedestrian attributes of the target samples in an online manner.This makes the AdaGPAR time efficient and generalizable for real-world applications.Extensive experiments have been performed on the UPAR benchmark to compare the proposed method with multiple baselines.The superior performance demonstrates the effectiveness of the proposed AdaGPAR that improves the generalizability of a PAR model via TTA. 展开更多
关键词 Pedestrian attribute recognition domain generalization test-time adaptation attribute localization non-parametric classifier
原文传递
Balanced Representation Learning for Long-tailed Skeleton-based Action Recognition
18
作者 Hongda Liu Yunlong Wang +4 位作者 Min Ren Junxing Hu Zhengquan Luo Guangqi Hou Zhenan Sun 《Machine Intelligence Research》 2025年第3期466-483,共18页
Skeleton-based action recognition has recently made significant progress.However,data imbalance is still a great challenge in real-world scenarios.The performance of current action recognition algorithms declines shar... Skeleton-based action recognition has recently made significant progress.However,data imbalance is still a great challenge in real-world scenarios.The performance of current action recognition algorithms declines sharply when training data suffers from heavy class imbalance.The imbalanced data actually degrades the representations learned by these methods and becomes the bottleneck for action recognition.How to learn unbiased representations from imbalanced action data is the key to long-tailed action recognition.In this paper,we propose a novel balanced representation learning method to address the long-tailed problem in action recognition.Firstly,a spatial-temporal action exploration strategy is presented to expand the sample space effectively,generating more valuable samples in a rebalanced manner.Secondly,we design a detached action-aware learning schedule to further mitigate the bias in the representation space.The schedule detaches the representation learning of tail classes from training and proposes an action-aware loss to impose more effective constraints.Additionally,a skip-type representation is proposed to provide complementary structural information.The proposed method is validated on four skeleton datasets,NTU RGB+D 60,NTU RGB+D 120,NW-UCLA and Kinetics.It not only achieves consistently large improvement compared to the state-of-the-art(SOTA)methods,but also demonstrates a superior generalization capacity through extensive experiments.Our code is available at https://github.com/firework8/BRL. 展开更多
关键词 Action recognition skeleton sequence long-tailed visual recognition imbalance learning.
原文传递
Multimodal Pretrained Knowledge for Real-world Object Navigation
19
作者 Hui Yuan Yan Huang +4 位作者 Naigong Yu Dongbo Zhang Zetao Du Ziqi Liu Kun Zhang 《Machine Intelligence Research》 2025年第4期713-729,共17页
Most visual-language navigation(VLN)research focuses on simulate environments,but applying these methods to real-world scenarios is challenging because of misalignments between vision and language in complex environme... Most visual-language navigation(VLN)research focuses on simulate environments,but applying these methods to real-world scenarios is challenging because of misalignments between vision and language in complex environments,leading to path deviations.To address this,we propose a novel vision-and-language object navigation strategy that uses multimodal pretrained knowledge as a cross-modal bridge to link semantic concepts in both images and text.This improves navigation supervision at key-points and enhances robustness.Specifically,we 1)randomly generate key-points within a specific density range and optimize them on the basis of challenging locations;2)use pretrained multimodal knowledge to efficiently retrieve target objects;3)combine depth information with simultaneous localization and mapping(SLAM)map data to predict optimal positions and orientations for accurate navigation;and 4)implement the method on a physical robot,successfully conducting navigation tests.Our approach achieves a maximum success rate of 66.7%,outperforming existing VLN methods in real-world environments. 展开更多
关键词 Visual-and-language object navigation key-points multimodal pretrained knowledge optimal positions and orientations physical robot
原文传递
Transformers in computational visual media:A survey 被引量:17
20
作者 Yifan Xu Huapeng Wei +7 位作者 Minxuan Lin Yingying Deng Kekai Sheng Mengdan Zhang Fan Tang Weiming Dong Feiyue Huang Changsheng Xu 《Computational Visual Media》 SCIE EI CSCD 2022年第1期33-62,共30页
Transformers,the dominant architecture for natural language processing,have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and hi... Transformers,the dominant architecture for natural language processing,have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance.Transformers are sequence-to-sequence models,which use a selfattention mechanism rather than the RNN sequential structure.Thus,such models can be trained in parallel and can represent global information.This study comprehensively surveys recent visual transformer works.We categorize them according to task scenario:backbone design,high-level vision,low-level vision and generation,and multimodal learning.Their key ideas are also analyzed.Differing from previous surveys,we mainly focus on visual transformer methods in low-level vision and generation.The latest works on backbone design are also reviewed in detail.For ease of understanding,we precisely describe the main contributions of the latest works in the form of tables.As well as giving quantitative comparisons,we also present image results for low-level vision and generation tasks.Computational costs and source code links for various important works are also given in this survey to assist further development. 展开更多
关键词 visual transformer computational visual media(CVM) high-level vision low-level vision image generation multi-modal learning
原文传递
上一页 1 2 下一页 到第
使用帮助 返回顶部