Audio-visual learning,aimed at exploiting the relationship between audio and visual modalities,has drawn considerable attention since deep learning started to be used successfully.Researchers tend to leverage these tw...Audio-visual learning,aimed at exploiting the relationship between audio and visual modalities,has drawn considerable attention since deep learning started to be used successfully.Researchers tend to leverage these two modalities to improve the performance of previously considered single-modality tasks or address new challenging problems.In this paper,we provide a comprehensive survey of recent audio-visual learning development.We divide the current audio-visual learning tasks into four different subfields:audiovisual separation and localization,audio-visual correspondence learning,audio-visual generation,and audio-visual representation learning.State-of-the-art methods,as well as the remaining challenges of each subfield,are further discussed.Finally,we summarize the commonly used datasets and challenges.展开更多
Pointwise convolution is usually utilized to expand or squeeze features in modern lightweight deep models.However,it takes up most of the overall computational cost(usually more than 90%).This paper proposes a novel P...Pointwise convolution is usually utilized to expand or squeeze features in modern lightweight deep models.However,it takes up most of the overall computational cost(usually more than 90%).This paper proposes a novel Poker module to expand features by taking advantage of cheap depthwise convolution.As a result,the Poker module can greatly reduce the computational cost,and meanwhile generate a large number of effective features to guarantee the performance.The proposed module is standardized and can be employed wherever the feature expansion is needed.By varying the stride and the number of channels,different kinds of bottlenecks are designed to plug the proposed Poker module into the network.Thus,a lightweight model can be easily assembled.Experiments conducted on benchmarks reveal the effectiveness of our proposed Poker module.And our Poker Net models can reduce the computational cost by 7.1%-15.6%.Poker Net models achieve comparable or even higher recognition accuracy than previous state-of-the-art(SOTA)models on the Image Net ILSVRC2012 classification dataset.Code is available at https://github.com/diaomin/pokernet.展开更多
Crowd density estimation in wide areas is a challenging problem for visual surveillance. Because of the high risk of degeneration, the safety of public events involving large crowds has always been a major concern. In...Crowd density estimation in wide areas is a challenging problem for visual surveillance. Because of the high risk of degeneration, the safety of public events involving large crowds has always been a major concern. In this paper, we propose a video-based crowd density analysis and prediction system for wide-area surveillance applications. In monocular image sequences, the Accumulated Mosaic Image Difference (AMID) method is applied to extract crowd areas having irregular motion. The specific number of persons and velocity of a crowd can be adequately estimated by our system from the density of crowded areas. Using a multi-camera network, we can obtain predictions of a crowd's density several minutes in advance. The system has been used in real applications, and numerous experiments conducted in real scenes (station, park, plaza) demonstrate the effectiveness and robustness of the proposed method.展开更多
Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input t...Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.展开更多
Few-shot learning attempts to identify novel categories by exploiting limited labeled training data,while the performances of existing methods still have much room for improvement.Thanks to a very low cost,many recent...Few-shot learning attempts to identify novel categories by exploiting limited labeled training data,while the performances of existing methods still have much room for improvement.Thanks to a very low cost,many recent methods resort to additional unlabeled training data to boost performance,known as semi-supervised few-shot learning(SSFSL).The general idea of SSFSL methods is to first generate pseudo labels for all unlabeled data and then augment the labeled training set with selected pseudo-labeled data.However,almost all previous SSFSL methods only take supervision signal from pseudo-labeling,ignoring that the distribution of training data can also be utilized as an effective unsupervised regularization.In this paper,we propose a simple yet effective SSFSL method named feature reconstruction based regression method(TENET),which takes low-rank feature reconstruction as the unsupervised objective function and pseudo labels as the supervised constraint.We provide several theoretical insights on why TENET can mitigate overfitting on low-quality training data,and why it can enhance the robustness against inaccurate pseudo labels.Extensive experiments on four popular datasets validate the effectiveness of TENET.展开更多
Training generative adversarial networks is data-demanding,which limits the development of these models on target domains with inadequate training data.Recently,researchers have leveraged generative models pretrained ...Training generative adversarial networks is data-demanding,which limits the development of these models on target domains with inadequate training data.Recently,researchers have leveraged generative models pretrained on sufficient data and fine-tuned them using small training samples,thus reducing data requirements.However,due to the lack of explicit focus on target styles and disproportionately concentrating on generative consistency,these methods do not perform well in diversity preservation which represents the adaptation ability for few-shot generative models.To mitigate the diversity degradation,we propose a framework with two key strategies:1)To obtain more diverse styles from limited training data effectively,we propose a cross-modal module that explicitly obtains the target styles with a style prototype space and text-guided style instructions.2)To inherit the generation capability from the pretrained model,we aim to constrain the similarity between the generated and source images with a structural discrepancy alignment module by maintaining the structure correlation in multiscale areas.We demonstrate the effectiveness of our method,which outperforms state-of-the-art methods in mitigating diversity degradation through extensive experiments and analyses.展开更多
Generalizable pedestrian attribute recognition(PAR)aims to learn a robust PAR model that can be directly adapted to unknown distributions under varying illumination,different viewpoints and occlusions,which is an esse...Generalizable pedestrian attribute recognition(PAR)aims to learn a robust PAR model that can be directly adapted to unknown distributions under varying illumination,different viewpoints and occlusions,which is an essential problem for real-world applications,such as video surveillance and fashion search.In practice,when a trained PAR model is deployed to real-world scenarios,the unseen target samples are fed into the model continuously in an online manner.Therefore,this paper proposes an efficient and flexible method,named AdaGPAR,for generalizable PAR(GPAR)via test-time adaptation(TTA),where we adapt the trained model through exploiting the unlabeled target samples online during the test phase.As far as we know,it is the first work that solves the GPAR from the perspective of TTA.In particular,the proposed AdaGPAR memorizes the reliable target sample pairs(features and pseudo-labels)as prototypes gradually in the test phase.Then,it makes predictions with a non-parametric classifier by calculating the similarity between a target instance and the prototypes.However,since PAR is a task of multi-label classification,only using the same holistic feature of one pedestrian image as the prototypes of multiple attributes is not optimal.Therefore,an attribute localization branch is introduced to extract the attribute-specific features,where two kinds of memory banks are further constructed to cache the global and attribute-specific features simultaneously.In summary,the AdaGPAR is training-free in the test phase and predicts multiple pedestrian attributes of the target samples in an online manner.This makes the AdaGPAR time efficient and generalizable for real-world applications.Extensive experiments have been performed on the UPAR benchmark to compare the proposed method with multiple baselines.The superior performance demonstrates the effectiveness of the proposed AdaGPAR that improves the generalizability of a PAR model via TTA.展开更多
Skeleton-based action recognition has recently made significant progress.However,data imbalance is still a great challenge in real-world scenarios.The performance of current action recognition algorithms declines shar...Skeleton-based action recognition has recently made significant progress.However,data imbalance is still a great challenge in real-world scenarios.The performance of current action recognition algorithms declines sharply when training data suffers from heavy class imbalance.The imbalanced data actually degrades the representations learned by these methods and becomes the bottleneck for action recognition.How to learn unbiased representations from imbalanced action data is the key to long-tailed action recognition.In this paper,we propose a novel balanced representation learning method to address the long-tailed problem in action recognition.Firstly,a spatial-temporal action exploration strategy is presented to expand the sample space effectively,generating more valuable samples in a rebalanced manner.Secondly,we design a detached action-aware learning schedule to further mitigate the bias in the representation space.The schedule detaches the representation learning of tail classes from training and proposes an action-aware loss to impose more effective constraints.Additionally,a skip-type representation is proposed to provide complementary structural information.The proposed method is validated on four skeleton datasets,NTU RGB+D 60,NTU RGB+D 120,NW-UCLA and Kinetics.It not only achieves consistently large improvement compared to the state-of-the-art(SOTA)methods,but also demonstrates a superior generalization capacity through extensive experiments.Our code is available at https://github.com/firework8/BRL.展开更多
Most visual-language navigation(VLN)research focuses on simulate environments,but applying these methods to real-world scenarios is challenging because of misalignments between vision and language in complex environme...Most visual-language navigation(VLN)research focuses on simulate environments,but applying these methods to real-world scenarios is challenging because of misalignments between vision and language in complex environments,leading to path deviations.To address this,we propose a novel vision-and-language object navigation strategy that uses multimodal pretrained knowledge as a cross-modal bridge to link semantic concepts in both images and text.This improves navigation supervision at key-points and enhances robustness.Specifically,we 1)randomly generate key-points within a specific density range and optimize them on the basis of challenging locations;2)use pretrained multimodal knowledge to efficiently retrieve target objects;3)combine depth information with simultaneous localization and mapping(SLAM)map data to predict optimal positions and orientations for accurate navigation;and 4)implement the method on a physical robot,successfully conducting navigation tests.Our approach achieves a maximum success rate of 66.7%,outperforming existing VLN methods in real-world environments.展开更多
Transformers,the dominant architecture for natural language processing,have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and hi...Transformers,the dominant architecture for natural language processing,have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance.Transformers are sequence-to-sequence models,which use a selfattention mechanism rather than the RNN sequential structure.Thus,such models can be trained in parallel and can represent global information.This study comprehensively surveys recent visual transformer works.We categorize them according to task scenario:backbone design,high-level vision,low-level vision and generation,and multimodal learning.Their key ideas are also analyzed.Differing from previous surveys,we mainly focus on visual transformer methods in low-level vision and generation.The latest works on backbone design are also reviewed in detail.For ease of understanding,we precisely describe the main contributions of the latest works in the form of tables.As well as giving quantitative comparisons,we also present image results for low-level vision and generation tasks.Computational costs and source code links for various important works are also given in this survey to assist further development.展开更多
基金supported by National Key Research and Development Program of China(No.2016YFB1001001)Beijing Natural Science Foundation(No.JQ18017)National Natural Science Foundation of China(No.61976002)。
文摘Audio-visual learning,aimed at exploiting the relationship between audio and visual modalities,has drawn considerable attention since deep learning started to be used successfully.Researchers tend to leverage these two modalities to improve the performance of previously considered single-modality tasks or address new challenging problems.In this paper,we provide a comprehensive survey of recent audio-visual learning development.We divide the current audio-visual learning tasks into four different subfields:audiovisual separation and localization,audio-visual correspondence learning,audio-visual generation,and audio-visual representation learning.State-of-the-art methods,as well as the remaining challenges of each subfield,are further discussed.Finally,we summarize the commonly used datasets and challenges.
基金supported by National Natural Science Foundation of China(Nos.61525306,61633021,61721004,61806194,U1803261 and 61976132)Major Project for New Generation of AI(No.2018AAA0100400)+2 种基金Beijing Nova Program(No.Z201100006820079)Shandong Provincial Key Research and Development Program(No.2019JZZY010119)CAS-AIR。
文摘Pointwise convolution is usually utilized to expand or squeeze features in modern lightweight deep models.However,it takes up most of the overall computational cost(usually more than 90%).This paper proposes a novel Poker module to expand features by taking advantage of cheap depthwise convolution.As a result,the Poker module can greatly reduce the computational cost,and meanwhile generate a large number of effective features to guarantee the performance.The proposed module is standardized and can be employed wherever the feature expansion is needed.By varying the stride and the number of channels,different kinds of bottlenecks are designed to plug the proposed Poker module into the network.Thus,a lightweight model can be easily assembled.Experiments conducted on benchmarks reveal the effectiveness of our proposed Poker module.And our Poker Net models can reduce the computational cost by 7.1%-15.6%.Poker Net models achieve comparable or even higher recognition accuracy than previous state-of-the-art(SOTA)models on the Image Net ILSVRC2012 classification dataset.Code is available at https://github.com/diaomin/pokernet.
基金supported by the National Natural Science Foundation of China under Grant No. 61175007the National Key Technologies R&D Program under Grant No. 2012BAH07B01the National Key Basic Research Program of China (973 Program) under Grant No. 2012CB316302
文摘Crowd density estimation in wide areas is a challenging problem for visual surveillance. Because of the high risk of degeneration, the safety of public events involving large crowds has always been a major concern. In this paper, we propose a video-based crowd density analysis and prediction system for wide-area surveillance applications. In monocular image sequences, the Accumulated Mosaic Image Difference (AMID) method is applied to extract crowd areas having irregular motion. The specific number of persons and velocity of a crowd can be adequately estimated by our system from the density of crowded areas. Using a multi-camera network, we can obtain predictions of a crowd's density several minutes in advance. The system has been used in real applications, and numerous experiments conducted in real scenes (station, park, plaza) demonstrate the effectiveness and robustness of the proposed method.
基金supported in part by the Major Project for New Generation of AI (2018AAA0100400)the National Natural Science Foundation of China (61836014,U21B2042,62072457,62006231)the InnoHK Program。
文摘Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.
基金supported in part by the Beijing Natural Science Foundation,China(No.L221013)in part by the National Science Foundation of China(Nos.U20B2070 and 61832016).
文摘Few-shot learning attempts to identify novel categories by exploiting limited labeled training data,while the performances of existing methods still have much room for improvement.Thanks to a very low cost,many recent methods resort to additional unlabeled training data to boost performance,known as semi-supervised few-shot learning(SSFSL).The general idea of SSFSL methods is to first generate pseudo labels for all unlabeled data and then augment the labeled training set with selected pseudo-labeled data.However,almost all previous SSFSL methods only take supervision signal from pseudo-labeling,ignoring that the distribution of training data can also be utilized as an effective unsupervised regularization.In this paper,we propose a simple yet effective SSFSL method named feature reconstruction based regression method(TENET),which takes low-rank feature reconstruction as the unsupervised objective function and pseudo labels as the supervised constraint.We provide several theoretical insights on why TENET can mitigate overfitting on low-quality training data,and why it can enhance the robustness against inaccurate pseudo labels.Extensive experiments on four popular datasets validate the effectiveness of TENET.
基金supported by the National Key Research and Development Program of China,China(No.2021YFC3320103)the National Natural Science Foundation of China,China(NSFC)(Nos.62372452 and 62272460)+1 种基金the Open Research Project of the State Key Laboratory of Media Convergence and Communication,Communication University of China,China(No.SKLM CC2022KF002)Youth Innovation Promotion Association CAS,China.
文摘Training generative adversarial networks is data-demanding,which limits the development of these models on target domains with inadequate training data.Recently,researchers have leveraged generative models pretrained on sufficient data and fine-tuned them using small training samples,thus reducing data requirements.However,due to the lack of explicit focus on target styles and disproportionately concentrating on generative consistency,these methods do not perform well in diversity preservation which represents the adaptation ability for few-shot generative models.To mitigate the diversity degradation,we propose a framework with two key strategies:1)To obtain more diverse styles from limited training data effectively,we propose a cross-modal module that explicitly obtains the target styles with a style prototype space and text-guided style instructions.2)To inherit the generation capability from the pretrained model,we aim to constrain the similarity between the generated and source images with a structural discrepancy alignment module by maintaining the structure correlation in multiscale areas.We demonstrate the effectiveness of our method,which outperforms state-of-the-art methods in mitigating diversity degradation through extensive experiments and analyses.
基金supported in part by the National Science and Technology Major project,China(No.2022ZD0117901)in part by the National Natural Science Foundation of China(Nos.62373355,62276256 and 62106260).
文摘Generalizable pedestrian attribute recognition(PAR)aims to learn a robust PAR model that can be directly adapted to unknown distributions under varying illumination,different viewpoints and occlusions,which is an essential problem for real-world applications,such as video surveillance and fashion search.In practice,when a trained PAR model is deployed to real-world scenarios,the unseen target samples are fed into the model continuously in an online manner.Therefore,this paper proposes an efficient and flexible method,named AdaGPAR,for generalizable PAR(GPAR)via test-time adaptation(TTA),where we adapt the trained model through exploiting the unlabeled target samples online during the test phase.As far as we know,it is the first work that solves the GPAR from the perspective of TTA.In particular,the proposed AdaGPAR memorizes the reliable target sample pairs(features and pseudo-labels)as prototypes gradually in the test phase.Then,it makes predictions with a non-parametric classifier by calculating the similarity between a target instance and the prototypes.However,since PAR is a task of multi-label classification,only using the same holistic feature of one pedestrian image as the prototypes of multiple attributes is not optimal.Therefore,an attribute localization branch is introduced to extract the attribute-specific features,where two kinds of memory banks are further constructed to cache the global and attribute-specific features simultaneously.In summary,the AdaGPAR is training-free in the test phase and predicts multiple pedestrian attributes of the target samples in an online manner.This makes the AdaGPAR time efficient and generalizable for real-world applications.Extensive experiments have been performed on the UPAR benchmark to compare the proposed method with multiple baselines.The superior performance demonstrates the effectiveness of the proposed AdaGPAR that improves the generalizability of a PAR model via TTA.
基金supported by the National Natural Science Foundation of China(Nos.62276263,62006225 and 62071468)the Strategic Priority Research Program of Chinese Academy of Sciences(CAS),China(No.XDA27040700)the National Key Research and Development Program of China(No.2022YFC3310400).
文摘Skeleton-based action recognition has recently made significant progress.However,data imbalance is still a great challenge in real-world scenarios.The performance of current action recognition algorithms declines sharply when training data suffers from heavy class imbalance.The imbalanced data actually degrades the representations learned by these methods and becomes the bottleneck for action recognition.How to learn unbiased representations from imbalanced action data is the key to long-tailed action recognition.In this paper,we propose a novel balanced representation learning method to address the long-tailed problem in action recognition.Firstly,a spatial-temporal action exploration strategy is presented to expand the sample space effectively,generating more valuable samples in a rebalanced manner.Secondly,we design a detached action-aware learning schedule to further mitigate the bias in the representation space.The schedule detaches the representation learning of tail classes from training and proposes an action-aware loss to impose more effective constraints.Additionally,a skip-type representation is proposed to provide complementary structural information.The proposed method is validated on four skeleton datasets,NTU RGB+D 60,NTU RGB+D 120,NW-UCLA and Kinetics.It not only achieves consistently large improvement compared to the state-of-the-art(SOTA)methods,but also demonstrates a superior generalization capacity through extensive experiments.Our code is available at https://github.com/firework8/BRL.
基金jointly supported by the National Natural Science Foundation of China(Nos.62236010,62322607,62276261 and 62076014)the Youth Innovation Promotion Association of Chinese Academy of Sciences,China(No.2021128)+1 种基金the Joint Fund of Natural Science of Hunan Province,China(No.2023JJ50242)the Key Projects of Education Department of Hunan Province,China(No.22A0115).
文摘Most visual-language navigation(VLN)research focuses on simulate environments,but applying these methods to real-world scenarios is challenging because of misalignments between vision and language in complex environments,leading to path deviations.To address this,we propose a novel vision-and-language object navigation strategy that uses multimodal pretrained knowledge as a cross-modal bridge to link semantic concepts in both images and text.This improves navigation supervision at key-points and enhances robustness.Specifically,we 1)randomly generate key-points within a specific density range and optimize them on the basis of challenging locations;2)use pretrained multimodal knowledge to efficiently retrieve target objects;3)combine depth information with simultaneous localization and mapping(SLAM)map data to predict optimal positions and orientations for accurate navigation;and 4)implement the method on a physical robot,successfully conducting navigation tests.Our approach achieves a maximum success rate of 66.7%,outperforming existing VLN methods in real-world environments.
基金supported by National Key R&D Program of China under Grant No.2020AAA0106200National Natural Science Foundation of China under Grant Nos.61832016 and U20B2070.
文摘Transformers,the dominant architecture for natural language processing,have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance.Transformers are sequence-to-sequence models,which use a selfattention mechanism rather than the RNN sequential structure.Thus,such models can be trained in parallel and can represent global information.This study comprehensively surveys recent visual transformer works.We categorize them according to task scenario:backbone design,high-level vision,low-level vision and generation,and multimodal learning.Their key ideas are also analyzed.Differing from previous surveys,we mainly focus on visual transformer methods in low-level vision and generation.The latest works on backbone design are also reviewed in detail.For ease of understanding,we precisely describe the main contributions of the latest works in the form of tables.As well as giving quantitative comparisons,we also present image results for low-level vision and generation tasks.Computational costs and source code links for various important works are also given in this survey to assist further development.