Increasing research has focused on semantic communication,the goal of which is to convey accurately the meaning instead of transmitting symbols from the sender to the receiver.In this paper,we design a novel encoding ...Increasing research has focused on semantic communication,the goal of which is to convey accurately the meaning instead of transmitting symbols from the sender to the receiver.In this paper,we design a novel encoding and decoding semantic communication framework,which adopts the semantic information and the contextual correlations between items to optimize the performance of a communication system over various channels.On the sender side,the average semantic loss caused by the wrong detection is defined,and a semantic source encoding strategy is developed to minimize the average semantic loss.To further improve communication reliability,a decoding strategy that utilizes the semantic and the context information to recover messages is proposed in the receiver.Extensive simulation results validate the superior performance of our strategies over state-of-the-art semantic coding and decoding policies on different communication channels.展开更多
Intrinsic decomposition,the process of decomposing an image into reflectance and shading,is widely used in virtual and augmented reality tasks.Reflectance and shading often exhibit large gradients at the object edges,...Intrinsic decomposition,the process of decomposing an image into reflectance and shading,is widely used in virtual and augmented reality tasks.Reflectance and shading often exhibit large gradients at the object edges,and the intrinsic properties on the same object tend to be similar.This spatial coherence is closely related to semantic consistency because objects within the same semantic category often exhibit similar intrinsic properties.Therefore,incorporating semantic segmentation into a deep intrinsic decomposition framework helps the network distinguish between different object instances and understand high-level scene structures.To this end,we design an intrinsic decomposition network jointly trained with a dedicated semantic segmentation module,allowing semantic cues to enhance the decomposition of reflectance and shading.The semantic module provides guidance during training but is removed during inference,improving performance without increasing the inference cost.Additionally,to capture the global contextual dependencies critical for intrinsic decomposition,we adopt a Transformer-based backbone.The proposed backbone enables the model to associate distant regions with similar material properties,thereby maintaining consistency in reflectance and learning smooth illumination patterns across a scene.A convolutional decoder is also designed to output predictions with improved details.Experiments demonstrate that our approach achieves state-of-the-art performance in the quantitative evaluations on the Intrinsic Images in the Wild(IIW)and Shading Annotations in the wild(SAW)datasets.展开更多
The convolutional neural network(CNN)method based on DeepLabv3+has some problems in the semantic segmentation task of high-resolution remote sensing images,such as fixed receiving field size of feature extraction,lack...The convolutional neural network(CNN)method based on DeepLabv3+has some problems in the semantic segmentation task of high-resolution remote sensing images,such as fixed receiving field size of feature extraction,lack of semantic information,high decoder magnification,and insufficient detail retention ability.A hierarchical feature fusion network(HFFNet)was proposed.Firstly,a combination of transformer and CNN architectures was employed for feature extraction from images of varying resolutions.The extracted features were processed independently.Subsequently,the features from the transformer and CNN were fused under the guidance of features from different sources.This fusion process assisted in restoring information more comprehensively during the decoding stage.Furthermore,a spatial channel attention module was designed in the final stage of decoding to refine features and reduce the semantic gap between shallow CNN features and deep decoder features.The experimental results showed that HFFNet had superior performance on UAVid,LoveDA,Potsdam,and Vaihingen datasets,and its cross-linking index was better than DeepLabv3+and other competing methods,showing strong generalization ability.展开更多
Existing multi-view three-dimensional(3 D) reconstruction methods can only capture single type of feature from input view, failing to obtain fine-grained semantics for reconstructing the complex shapes. They rarely ex...Existing multi-view three-dimensional(3 D) reconstruction methods can only capture single type of feature from input view, failing to obtain fine-grained semantics for reconstructing the complex shapes. They rarely explore the semantic association between input views, leading to a rough 3 D shape. To address these challenges, we propose a semantics-aware transformer(SATF) for 3 D reconstruction. It is composed of two parallel view transformer encoders and a point cloud transformer decoder, and takes two red, green and blue(RGB) images as input and outputs a dense point cloud with richer details. Each view transformer encoder can learn a multi-level feature, facilitating characterizing fine-grained semantics from input view. The point cloud transformer decoder explores a semantically-associated feature by aligning the semantics of two input views, which describes the semantic association between views. Furthermore, it can generate a sparse point cloud using the semantically-associated feature. At last, the decoder enriches the sparse point cloud for producing a dense point cloud with richer details. Extensive experiments on the Shape Net dataset show that our SATF outperforms the state-of-the-art methods.展开更多
In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is de...In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is dependent on a single video input source and few visual labels,and there is a problem with semantic alignment between video contents and generated natural sentences,which are not suitable for accurately comprehending and describing the video contents.To address this issue,this paper proposes a video captioning method by semantic topic-guided generation.First,a 3D convolutional neural network is utilized to extract the spatiotemporal features of videos during the encoding.Then,the semantic topics of video data are extracted using the visual labels retrieved from similar video data.In the decoding,a decoder is constructed by combining a novel Enhance-TopK sampling algorithm with a Generative Pre-trained Transformer-2 deep neural network,which decreases the influence of“deviation”in the semantic mapping process between videos and texts by jointly decoding a baseline and semantic topics of video contents.During this process,the designed Enhance-TopK sampling algorithm can alleviate a long-tail problem by dynamically adjusting the probability distribution of the predicted words.Finally,the experiments are conducted on two publicly used Microsoft Research Video Description andMicrosoft Research-Video to Text datasets.The experimental results demonstrate that the proposed method outperforms several state-of-art approaches.Specifically,the performance indicators Bilingual Evaluation Understudy,Metric for Evaluation of Translation with Explicit Ordering,Recall Oriented Understudy for Gisting Evaluation-longest common subsequence,and Consensus-based Image Description Evaluation of the proposed method are improved by 1.2%,0.1%,0.3%,and 2.4% on the Microsoft Research Video Description dataset,and 0.1%,1.0%,0.1%,and 2.8% on the Microsoft Research-Video to Text dataset,respectively,compared with the existing video captioning methods.As a result,the proposed method can generate video captioning that is more closely aligned with human natural language expression habits.展开更多
基金supported in part by the National Natural Science Foundation of China under Grant No.61931020,U19B2024,62171449,62001483in part by the science and technology innovation Program of Hunan Province under Grant No.2021JJ40690。
文摘Increasing research has focused on semantic communication,the goal of which is to convey accurately the meaning instead of transmitting symbols from the sender to the receiver.In this paper,we design a novel encoding and decoding semantic communication framework,which adopts the semantic information and the contextual correlations between items to optimize the performance of a communication system over various channels.On the sender side,the average semantic loss caused by the wrong detection is defined,and a semantic source encoding strategy is developed to minimize the average semantic loss.To further improve communication reliability,a decoding strategy that utilizes the semantic and the context information to recover messages is proposed in the receiver.Extensive simulation results validate the superior performance of our strategies over state-of-the-art semantic coding and decoding policies on different communication channels.
基金Supported by Science and Technology Innovation 2030:Major Project of“New Generation Artificial Intelligence”(No.2022ZD0115901)the National Natural Science Foundation of China(No.62332003).
文摘Intrinsic decomposition,the process of decomposing an image into reflectance and shading,is widely used in virtual and augmented reality tasks.Reflectance and shading often exhibit large gradients at the object edges,and the intrinsic properties on the same object tend to be similar.This spatial coherence is closely related to semantic consistency because objects within the same semantic category often exhibit similar intrinsic properties.Therefore,incorporating semantic segmentation into a deep intrinsic decomposition framework helps the network distinguish between different object instances and understand high-level scene structures.To this end,we design an intrinsic decomposition network jointly trained with a dedicated semantic segmentation module,allowing semantic cues to enhance the decomposition of reflectance and shading.The semantic module provides guidance during training but is removed during inference,improving performance without increasing the inference cost.Additionally,to capture the global contextual dependencies critical for intrinsic decomposition,we adopt a Transformer-based backbone.The proposed backbone enables the model to associate distant regions with similar material properties,thereby maintaining consistency in reflectance and learning smooth illumination patterns across a scene.A convolutional decoder is also designed to output predictions with improved details.Experiments demonstrate that our approach achieves state-of-the-art performance in the quantitative evaluations on the Intrinsic Images in the Wild(IIW)and Shading Annotations in the wild(SAW)datasets.
基金supported by National Natural Science Foundation of China(No.52374155)Anhui Provincial Natural Science Foundation(No.2308085 MF218).
文摘The convolutional neural network(CNN)method based on DeepLabv3+has some problems in the semantic segmentation task of high-resolution remote sensing images,such as fixed receiving field size of feature extraction,lack of semantic information,high decoder magnification,and insufficient detail retention ability.A hierarchical feature fusion network(HFFNet)was proposed.Firstly,a combination of transformer and CNN architectures was employed for feature extraction from images of varying resolutions.The extracted features were processed independently.Subsequently,the features from the transformer and CNN were fused under the guidance of features from different sources.This fusion process assisted in restoring information more comprehensively during the decoding stage.Furthermore,a spatial channel attention module was designed in the final stage of decoding to refine features and reduce the semantic gap between shallow CNN features and deep decoder features.The experimental results showed that HFFNet had superior performance on UAVid,LoveDA,Potsdam,and Vaihingen datasets,and its cross-linking index was better than DeepLabv3+and other competing methods,showing strong generalization ability.
基金supported by the National Key R&D Program of China (No.2018YFB1305200)the National Natural Science Foundation of China (Nos.61906134, 62020106004, 92048301, and 61925201)
文摘Existing multi-view three-dimensional(3 D) reconstruction methods can only capture single type of feature from input view, failing to obtain fine-grained semantics for reconstructing the complex shapes. They rarely explore the semantic association between input views, leading to a rough 3 D shape. To address these challenges, we propose a semantics-aware transformer(SATF) for 3 D reconstruction. It is composed of two parallel view transformer encoders and a point cloud transformer decoder, and takes two red, green and blue(RGB) images as input and outputs a dense point cloud with richer details. Each view transformer encoder can learn a multi-level feature, facilitating characterizing fine-grained semantics from input view. The point cloud transformer decoder explores a semantically-associated feature by aligning the semantics of two input views, which describes the semantic association between views. Furthermore, it can generate a sparse point cloud using the semantically-associated feature. At last, the decoder enriches the sparse point cloud for producing a dense point cloud with richer details. Extensive experiments on the Shape Net dataset show that our SATF outperforms the state-of-the-art methods.
基金supported in part by the National Natural Science Foundation of China under Grant 61873277in part by the Natural Science Basic Research Plan in Shaanxi Province of China underGrant 2020JQ-758in part by the Chinese Postdoctoral Science Foundation under Grant 2020M673446.
文摘In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is dependent on a single video input source and few visual labels,and there is a problem with semantic alignment between video contents and generated natural sentences,which are not suitable for accurately comprehending and describing the video contents.To address this issue,this paper proposes a video captioning method by semantic topic-guided generation.First,a 3D convolutional neural network is utilized to extract the spatiotemporal features of videos during the encoding.Then,the semantic topics of video data are extracted using the visual labels retrieved from similar video data.In the decoding,a decoder is constructed by combining a novel Enhance-TopK sampling algorithm with a Generative Pre-trained Transformer-2 deep neural network,which decreases the influence of“deviation”in the semantic mapping process between videos and texts by jointly decoding a baseline and semantic topics of video contents.During this process,the designed Enhance-TopK sampling algorithm can alleviate a long-tail problem by dynamically adjusting the probability distribution of the predicted words.Finally,the experiments are conducted on two publicly used Microsoft Research Video Description andMicrosoft Research-Video to Text datasets.The experimental results demonstrate that the proposed method outperforms several state-of-art approaches.Specifically,the performance indicators Bilingual Evaluation Understudy,Metric for Evaluation of Translation with Explicit Ordering,Recall Oriented Understudy for Gisting Evaluation-longest common subsequence,and Consensus-based Image Description Evaluation of the proposed method are improved by 1.2%,0.1%,0.3%,and 2.4% on the Microsoft Research Video Description dataset,and 0.1%,1.0%,0.1%,and 2.8% on the Microsoft Research-Video to Text dataset,respectively,compared with the existing video captioning methods.As a result,the proposed method can generate video captioning that is more closely aligned with human natural language expression habits.