Background In recent decades,unmanned aerial vehicles(UAVs)have developed rapidly and been widely applied in many domains,including photography,reconstruction,monitoring,and search and rescue.In such applications,one ...Background In recent decades,unmanned aerial vehicles(UAVs)have developed rapidly and been widely applied in many domains,including photography,reconstruction,monitoring,and search and rescue.In such applications,one key issue is path and view planning,which tells UAVs exactly where to fly and how to search.Methods With specific consideration for three popular UAV applications(scene reconstruction,environment exploration,and aerial cinematography),we present a survey that should assist researchers in positioning and evaluating their works in the context of existing solutions.Results/Conclusions It should also help newcomers and practitioners in related fields quickly gain an overview of the vast literature.In addition to the current research status,we analyze and elaborate on advantages,disadvantages,and potential explorative trends for each application domain.展开更多
Exemplar-based image translation involves converting semantic masks into photorealistic images that adopt the style of a given exemplar.However,most existing GAN-based translation methods fail to produce photorealisti...Exemplar-based image translation involves converting semantic masks into photorealistic images that adopt the style of a given exemplar.However,most existing GAN-based translation methods fail to produce photorealistic results.In this study,we propose a new diffusion model-based approach for generating high-quality images that are semantically aligned with the input mask and resemble an exemplar in style.The proposed method trains a conditional denoising diffusion probabilistic model(DDPM)with a SPADE module to integrate the semantic map.We then used a novel contextual loss and auxiliary color loss to guide the optimization process,resulting in images that were visually pleasing and semantically accurate.Experiments demonstrate that our method outperforms state-of-the-art approaches in terms of both visual quality and quantitative metrics.展开更多
This study introduces CLIP-Flow,a novel network for generating images from a given image or text.To effectively utilize the rich semantics contained in both modalities,we designed a semantics-guided methodology for im...This study introduces CLIP-Flow,a novel network for generating images from a given image or text.To effectively utilize the rich semantics contained in both modalities,we designed a semantics-guided methodology for image-and text-to-image synthesis.In particular,we adopted Contrastive Language-Image Pretraining(CLIP)as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information.Moreover,to bridge the embedding space of CLIP and latent space of StyleGAN,real NVP is employed and modified with activation normalization and invertible convolution.As the images and text in CLIP share the same representation space,text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis.We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method.In addition,we tested on the public dataset Multi-Modal CelebA-HQ,for text-to-image synthesis.Experiments validated that our approach can generate high-quality text-matching images,and is comparable with state-of-the-art methods,both qualitatively and quantitatively.展开更多
Active vision is inherently attention-driven:an agent actively selects views to attend in order to rapidly perform a vision task while improving its internal representation of the scene being observed.Inspired by the ...Active vision is inherently attention-driven:an agent actively selects views to attend in order to rapidly perform a vision task while improving its internal representation of the scene being observed.Inspired by the recent success of attention-based models in 2D vision tasks based on single RGB images, we address multi-view depth-based active object recognition using an attention mechanism, by use of an end-to-end recurrent 3D attentional network. The architecture takes advantage of a recurrent neural network to store and update an internal representation. Our model,trained with 3D shape datasets, is able to iteratively attend the best views targeting an object of interest for recognizing it. To realize 3D view selection, we derive a 3D spatial transformer network. It is dierentiable,allowing training with backpropagation, and so achieving much faster convergence than the reinforcement learning employed by most existing attention-based models. Experiments show that our method, with only depth input, achieves state-of-the-art next-best-view performance both in terms of time taken and recognition accuracy.展开更多
基金LHTD(20170003)and the Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ).
文摘Background In recent decades,unmanned aerial vehicles(UAVs)have developed rapidly and been widely applied in many domains,including photography,reconstruction,monitoring,and search and rescue.In such applications,one key issue is path and view planning,which tells UAVs exactly where to fly and how to search.Methods With specific consideration for three popular UAV applications(scene reconstruction,environment exploration,and aerial cinematography),we present a survey that should assist researchers in positioning and evaluating their works in the context of existing solutions.Results/Conclusions It should also help newcomers and practitioners in related fields quickly gain an overview of the vast literature.In addition to the current research status,we analyze and elaborate on advantages,disadvantages,and potential explorative trends for each application domain.
基金supported in part by National Natural Science Foundation of China(U21B2023)DEGP Innovation Team(2022KCXTD025)+1 种基金Shenzhen Science and Technology Program(KQTD20210811090044003,RCJC20200714114435012)Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ).
文摘Exemplar-based image translation involves converting semantic masks into photorealistic images that adopt the style of a given exemplar.However,most existing GAN-based translation methods fail to produce photorealistic results.In this study,we propose a new diffusion model-based approach for generating high-quality images that are semantically aligned with the input mask and resemble an exemplar in style.The proposed method trains a conditional denoising diffusion probabilistic model(DDPM)with a SPADE module to integrate the semantic map.We then used a novel contextual loss and auxiliary color loss to guide the optimization process,resulting in images that were visually pleasing and semantically accurate.Experiments demonstrate that our method outperforms state-of-the-art approaches in terms of both visual quality and quantitative metrics.
基金supported in parts by the National Natural Science Foundation of China(62161146005,U21B2023)Shenzhen Science and Technology Program(KQTD20210811090044003,RCJC20200714114435012)Israel Science Foundation.
文摘This study introduces CLIP-Flow,a novel network for generating images from a given image or text.To effectively utilize the rich semantics contained in both modalities,we designed a semantics-guided methodology for image-and text-to-image synthesis.In particular,we adopted Contrastive Language-Image Pretraining(CLIP)as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information.Moreover,to bridge the embedding space of CLIP and latent space of StyleGAN,real NVP is employed and modified with activation normalization and invertible convolution.As the images and text in CLIP share the same representation space,text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis.We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method.In addition,we tested on the public dataset Multi-Modal CelebA-HQ,for text-to-image synthesis.Experiments validated that our approach can generate high-quality text-matching images,and is comparable with state-of-the-art methods,both qualitatively and quantitatively.
基金supported by National Natural Science Foundation of China (Nos. 61572507, 61622212, and 61532003)supported by the China Scholarship Council
文摘Active vision is inherently attention-driven:an agent actively selects views to attend in order to rapidly perform a vision task while improving its internal representation of the scene being observed.Inspired by the recent success of attention-based models in 2D vision tasks based on single RGB images, we address multi-view depth-based active object recognition using an attention mechanism, by use of an end-to-end recurrent 3D attentional network. The architecture takes advantage of a recurrent neural network to store and update an internal representation. Our model,trained with 3D shape datasets, is able to iteratively attend the best views targeting an object of interest for recognizing it. To realize 3D view selection, we derive a 3D spatial transformer network. It is dierentiable,allowing training with backpropagation, and so achieving much faster convergence than the reinforcement learning employed by most existing attention-based models. Experiments show that our method, with only depth input, achieves state-of-the-art next-best-view performance both in terms of time taken and recognition accuracy.