视频字幕生成(Video Captioning)旨在用自然语言描述视频中的内容,在人机交互、辅助视障人士、体育视频解说等领域具有广泛的应用前景。然而视频中复杂的时空内容变化增加了视频字幕生成的难度,之前的方法通过提取时空特征、先验信息等...视频字幕生成(Video Captioning)旨在用自然语言描述视频中的内容,在人机交互、辅助视障人士、体育视频解说等领域具有广泛的应用前景。然而视频中复杂的时空内容变化增加了视频字幕生成的难度,之前的方法通过提取时空特征、先验信息等方式提高生成字幕的质量,但在时空联合建模方面仍存在不足,可能导致视觉信息提取不充分,影响字幕生成结果。为了解决这个问题,本文提出一种新颖的时空增强的状态空间模型和Transformer(SpatioTemporal-enhanced State space model and Transformer,ST2)模型,通过引入最近流行的具有全局感受野和线性的计算复杂度的Mamba(一种状态空间模型),增强时空联合建模能力。首先,通过将Mamba与Transformer并行结合,提出空间增强的状态空间模型(State Space Model,SSM)和Transformer(Spatial enHanced State space model and Transformer module,SH-ST),克服了卷积的感受野问题并降低计算复杂度,同时增强模型提取空间信息的能力。然后为了增强时间建模,我们利用Mamba的时间扫描特性,并结合Transformer的全局建模能力,提出时间增强的SSM和Transformer(Temporal enHanced State space model and Transformer module,TH-ST)。具体地,我们对SH-ST产生的特征进行重排序,从而使Mamba以交叉扫描的方式增强重排序后特征的时间关系,最后用Transformer进一步增强时间建模能力。实验结果表明,我们ST2模型中SH-ST和TH-ST结构设计的有效性,且在广泛使用的视频字幕生成数据集MSVD和MSR-VTT上取得了具有竞争力的结果。具体的,我们的方法分别在MSVD和MSR-VTT数据集上的绝对CIDEr分数超过最先进的结果6.9%和2.6%,在MSVD上的绝对CIDEr分数超过了基线结果4.9%。展开更多
Two video coding schemes based on wavelet transform achieving very low bit rate are presented in this paper. The first is a hybrid motion compensated wavelet transform(MC WT)system which behaves better at very low ...Two video coding schemes based on wavelet transform achieving very low bit rate are presented in this paper. The first is a hybrid motion compensated wavelet transform(MC WT)system which behaves better at very low bit rates than the block DCT residual coder. The second is a new efficient coding system based on a simple frame differencing wavelet transform(FD WT)which performs well in both PSNR and visual quality with substantially reduced complexity.展开更多
This paper presents an algorithm for coding video signal based on 3-D wavelet transformation. When the frame order t of a video signal is replaced by order 2, the video signal can be looked as a block in 3-D space. Af...This paper presents an algorithm for coding video signal based on 3-D wavelet transformation. When the frame order t of a video signal is replaced by order 2, the video signal can be looked as a block in 3-D space. After splitting the block into smaller sub-blocks, imitate the method of 2-D wavelet transformation for images, we can transform the sub-blocks with 3-D wavelet. Most of video signal energy is in the decomposed low-frequency sub-bands. These sub-bands affect the visual quality of the video signal most. Quantizing different sub-bands with different precision and then entropy encoding each sub-band, we can eliminate inter- and intra-frame redundancy of the video signal and compress data. Our simulation experiments show that this algorithm can achieve very good result.展开更多
This paper presents an optimized 3-D Discrete Wavelet Transform (3-DDWT) architecture. 1-DDWT employed for the design of 3-DDWT architecture uses reduced lifting scheme approach. Further the architecture is optimized ...This paper presents an optimized 3-D Discrete Wavelet Transform (3-DDWT) architecture. 1-DDWT employed for the design of 3-DDWT architecture uses reduced lifting scheme approach. Further the architecture is optimized by applying block enabling technique, scaling, and rounding of the filter coefficients. The proposed architecture uses biorthogonal (9/7) wavelet filter. The architecture is modeled using Verilog HDL, simulated using ModelSim, synthesized using Xilinx ISE and finally implemented on Virtex-5 FPGA. The proposed 3-DDWT architecture has slice register utilization of 5%, operating frequency of 396 MHz and a power consumption of 0.45 W.展开更多
A new improved Goh's 3 D wavelet transform(WT) coding scheme is presented in this paper. The new scheme has great advantages including a simple code structure, low computation cost and good performance in PSNR, c...A new improved Goh's 3 D wavelet transform(WT) coding scheme is presented in this paper. The new scheme has great advantages including a simple code structure, low computation cost and good performance in PSNR, compression ratios and visual quality of reconstructions, when compared to the other existing 3 D WT coding methods and the 2 D WT based coding methods. The new 3 D WT coding scheme is suitable for very low bit rate video coding.展开更多
A new motion compensated 3 D wavelet transform (MC 3DWT) video coding scheme is presented in this paper. The new coding scheme has a good performance in average PSNR, compression ratio and visual quality of reconst...A new motion compensated 3 D wavelet transform (MC 3DWT) video coding scheme is presented in this paper. The new coding scheme has a good performance in average PSNR, compression ratio and visual quality of reconstructions compared with the existing 3 D wavelet transform (3DWT) coding methods and motion compensated 2 D wavelet transform (MC WT) coding method. The new MC 3DWT coding scheme is suitable for very low bit rate video coding.展开更多
为了解决视频行人再识别领域仅使用卷积神经网络进行行人特征提取效果不佳的问题,提出一种基于卷积神经网络和Transformer的ResTNet(ResNet and Transformer network)网络模型。ResTNet利用ResNet50网络得到局部特征,令中间层输出作为Tr...为了解决视频行人再识别领域仅使用卷积神经网络进行行人特征提取效果不佳的问题,提出一种基于卷积神经网络和Transformer的ResTNet(ResNet and Transformer network)网络模型。ResTNet利用ResNet50网络得到局部特征,令中间层输出作为Transformer的先验知识输入。在Transformer分支中不断缩小特征图尺寸,扩大感受野,充分挖掘局部特征之间的关系,生成行人的全局特征,同时利用移位窗口方法减少模型计算量。在大规模MARS数据集上,Rank-1和mAP分别达到86.8%和80.3%,比基准分别增加了3.8%和3.3%,在2个小规模数据集上也取得了良好效果。在几大数据集上的大量实验表明,本文方法能增强行人识别的鲁棒性,有效提高行人再识别的准确率。展开更多
文摘视频字幕生成(Video Captioning)旨在用自然语言描述视频中的内容,在人机交互、辅助视障人士、体育视频解说等领域具有广泛的应用前景。然而视频中复杂的时空内容变化增加了视频字幕生成的难度,之前的方法通过提取时空特征、先验信息等方式提高生成字幕的质量,但在时空联合建模方面仍存在不足,可能导致视觉信息提取不充分,影响字幕生成结果。为了解决这个问题,本文提出一种新颖的时空增强的状态空间模型和Transformer(SpatioTemporal-enhanced State space model and Transformer,ST2)模型,通过引入最近流行的具有全局感受野和线性的计算复杂度的Mamba(一种状态空间模型),增强时空联合建模能力。首先,通过将Mamba与Transformer并行结合,提出空间增强的状态空间模型(State Space Model,SSM)和Transformer(Spatial enHanced State space model and Transformer module,SH-ST),克服了卷积的感受野问题并降低计算复杂度,同时增强模型提取空间信息的能力。然后为了增强时间建模,我们利用Mamba的时间扫描特性,并结合Transformer的全局建模能力,提出时间增强的SSM和Transformer(Temporal enHanced State space model and Transformer module,TH-ST)。具体地,我们对SH-ST产生的特征进行重排序,从而使Mamba以交叉扫描的方式增强重排序后特征的时间关系,最后用Transformer进一步增强时间建模能力。实验结果表明,我们ST2模型中SH-ST和TH-ST结构设计的有效性,且在广泛使用的视频字幕生成数据集MSVD和MSR-VTT上取得了具有竞争力的结果。具体的,我们的方法分别在MSVD和MSR-VTT数据集上的绝对CIDEr分数超过最先进的结果6.9%和2.6%,在MSVD上的绝对CIDEr分数超过了基线结果4.9%。
文摘Two video coding schemes based on wavelet transform achieving very low bit rate are presented in this paper. The first is a hybrid motion compensated wavelet transform(MC WT)system which behaves better at very low bit rates than the block DCT residual coder. The second is a new efficient coding system based on a simple frame differencing wavelet transform(FD WT)which performs well in both PSNR and visual quality with substantially reduced complexity.
文摘This paper presents an algorithm for coding video signal based on 3-D wavelet transformation. When the frame order t of a video signal is replaced by order 2, the video signal can be looked as a block in 3-D space. After splitting the block into smaller sub-blocks, imitate the method of 2-D wavelet transformation for images, we can transform the sub-blocks with 3-D wavelet. Most of video signal energy is in the decomposed low-frequency sub-bands. These sub-bands affect the visual quality of the video signal most. Quantizing different sub-bands with different precision and then entropy encoding each sub-band, we can eliminate inter- and intra-frame redundancy of the video signal and compress data. Our simulation experiments show that this algorithm can achieve very good result.
文摘This paper presents an optimized 3-D Discrete Wavelet Transform (3-DDWT) architecture. 1-DDWT employed for the design of 3-DDWT architecture uses reduced lifting scheme approach. Further the architecture is optimized by applying block enabling technique, scaling, and rounding of the filter coefficients. The proposed architecture uses biorthogonal (9/7) wavelet filter. The architecture is modeled using Verilog HDL, simulated using ModelSim, synthesized using Xilinx ISE and finally implemented on Virtex-5 FPGA. The proposed 3-DDWT architecture has slice register utilization of 5%, operating frequency of 396 MHz and a power consumption of 0.45 W.
文摘A new improved Goh's 3 D wavelet transform(WT) coding scheme is presented in this paper. The new scheme has great advantages including a simple code structure, low computation cost and good performance in PSNR, compression ratios and visual quality of reconstructions, when compared to the other existing 3 D WT coding methods and the 2 D WT based coding methods. The new 3 D WT coding scheme is suitable for very low bit rate video coding.
文摘A new motion compensated 3 D wavelet transform (MC 3DWT) video coding scheme is presented in this paper. The new coding scheme has a good performance in average PSNR, compression ratio and visual quality of reconstructions compared with the existing 3 D wavelet transform (3DWT) coding methods and motion compensated 2 D wavelet transform (MC WT) coding method. The new MC 3DWT coding scheme is suitable for very low bit rate video coding.
文摘为了解决视频行人再识别领域仅使用卷积神经网络进行行人特征提取效果不佳的问题,提出一种基于卷积神经网络和Transformer的ResTNet(ResNet and Transformer network)网络模型。ResTNet利用ResNet50网络得到局部特征,令中间层输出作为Transformer的先验知识输入。在Transformer分支中不断缩小特征图尺寸,扩大感受野,充分挖掘局部特征之间的关系,生成行人的全局特征,同时利用移位窗口方法减少模型计算量。在大规模MARS数据集上,Rank-1和mAP分别达到86.8%和80.3%,比基准分别增加了3.8%和3.3%,在2个小规模数据集上也取得了良好效果。在几大数据集上的大量实验表明,本文方法能增强行人识别的鲁棒性,有效提高行人再识别的准确率。