Scene graphs of point clouds help to understand object-level relationships in the 3D space.Most graph generation methods work on 2D structured data,which cannot be used for the 3D unstructured point cloud data.Existin...Scene graphs of point clouds help to understand object-level relationships in the 3D space.Most graph generation methods work on 2D structured data,which cannot be used for the 3D unstructured point cloud data.Existing point-cloud-based methods generate the scene graph with an additional graph structure that needs labor-intensive manual annotation.To address these problems,we explore a method to convert the point clouds into structured data and generate graphs without given structures.Specifically,we cluster points with similar augmented features into groups and establish their relationships,resulting in an initial structural representation of the point cloud.Besides,we propose a Dynamic Graph Generation Network(DGGN)to judge the semantic labels of targets of different granularity.It dynamically splits and merges point groups,resulting in a scene graph with high precision.Experiments show that our methods outperform other baseline methods.They output reliable graphs describing the object-level relationships without additional manual labeled data.展开更多
Existing GAN-based generative methodsare typically used for semantic image synthesis. Wepose the question of whether GAN-based architecturescan generate plausible depth maps and find thatexisting methods have difficul...Existing GAN-based generative methodsare typically used for semantic image synthesis. Wepose the question of whether GAN-based architecturescan generate plausible depth maps and find thatexisting methods have difficulty in generating depthmaps which reasonably represent 3D scene structuredue to the lack of global geometric correlations.Thus, we propose DepthGAN, a novel method ofgenerating a depth map using a semantic layout asinput to aid construction, and manipulation of wellstructured 3D scene point clouds. Specifically, wefirst build a feature generation model with a cascadeof semantically-aware transformer blocks to obtaindepth features with global structural information.For our semantically aware transformer block, wepropose a mixed attention module and a semanticallyaware layer normalization module to better exploitsemantic consistency for depth features generation.Moreover, we present a novel semantically weighteddepth synthesis module, which generates adaptivedepth intervals for the current scene. We generate thefinal depth map by using a weighted combination ofsemantically aware depth weights for different depthranges. In this manner, we obtain a more accuratedepth map. Extensive experiments on indoor andoutdoor datasets demonstrate that DepthGAN achievessuperior results both quantitatively and visually for thedepth generation task.展开更多
We present SinGRAV, an attempt to learn a generative radiance volume from multi-view observations of a single natural scene, in stark contrast to existing category-level 3D generative models that learn from images of ...We present SinGRAV, an attempt to learn a generative radiance volume from multi-view observations of a single natural scene, in stark contrast to existing category-level 3D generative models that learn from images of many object-centric scenes. Inspired by SinGAN, we also learn the internal distribution of the input scene, which necessitates our key designs w.r.t. the scene representation and network architecture. Unlike popular multi-layer perceptrons (MLP)-based architectures, we particularly employ convolutional generators and discriminators, which inherently possess spatial locality bias, to operate over voxelized volumes for learning the internal distribution over a plethora of overlapping regions. On the other hand, localizing the adversarial generators and discriminators over confined areas with limited receptive fields easily leads to highly implausible geometric structures in the spatial. Our remedy is to use spatial inductive bias and joint discrimination on geometric clues in the form of 2D depth maps. This strategy is effective in improving spatial arrangement while incurring negligible additional computational cost. Experimental results demonstrate the ability of SinGRAV in generating plausible and diverse variations from a single scene, the merits of SinGRAV over state-of-the-art generative neural scene models, and the versatility of SinGRAV by its use in a variety of applications. Code and data will be released to facilitate further research.展开更多
The emergence of 3D Gaussian splatting(3DGS)has greatly accelerated rendering in novel view synthesis.Unlike neural implicit representations like neural radiance fields(NeRFs)that represent a 3D scene with position an...The emergence of 3D Gaussian splatting(3DGS)has greatly accelerated rendering in novel view synthesis.Unlike neural implicit representations like neural radiance fields(NeRFs)that represent a 3D scene with position and viewpoint-conditioned neural networks,3D Gaussian splatting utilizes a set of Gaussian ellipsoids to model the scene so that efficient rendering can be accomplished by rasterizing Gaussian ellipsoids into images.Apart from fast rendering,the explicit representation of 3D Gaussian splatting also facilitates downstream tasks like dynamic reconstruction,geometry editing,and physical simulation.Considering the rapid changes and growing number of works in this field,we present a literature review of recent 3D Gaussian splatting methods,which can be roughly classified by functionality into 3D reconstruction,3D editing,and other downstream applications.Traditional point-based rendering methods and the rendering formulation of 3D Gaussian splatting are also covered to aid understanding of this technique.This survey aims to help beginners to quickly get started in this field and to provide experienced researchers with a comprehensive overview,aiming to stimulate future development of the 3D Gaussian splatting representation.展开更多
Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem.It needs to solve several difficult tasks across the fields of natural language processing...Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem.It needs to solve several difficult tasks across the fields of natural language processing and computer vision.We model it as a combination of semantic entity recognition,object retrieval and recombination,and objects’status optimization.To reach a satisfactory result,we propose a comprehensive pipeline to convert the input text to its visual counterpart.The pipeline includes text processing,foreground objects and background scene retrieval,image synthesis using constrained MCMC,and post-processing.Firstly,we roughly divide the objects parsed from the input text into foreground objects and background scenes.Secondly,we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset,and retrieve an appropriate background scene image from the background image dataset extracted from the Internet.Thirdly,in order to ensure the rationality of foreground objects’positions and sizes in the image synthesis step,we design a cost function and use the Markov Chain Monte Carlo(MCMC)method as the optimizer to solve this constrained layout problem.Finally,to make the image look natural and harmonious,we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step.The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks(GANs)in visual quality of generated scene images.展开更多
Learning activities interactions between small groups is a key step in understanding team sports videos.Recent research focusing on team sports videos can be strictly regarded from the perspective of the audience rath...Learning activities interactions between small groups is a key step in understanding team sports videos.Recent research focusing on team sports videos can be strictly regarded from the perspective of the audience rather than the athlete.For team sports videos such as volleyball and basketball videos,there are plenty of intra-team and inter-team relations.In this paper,a new task named Group Scene Graph Generation is introduced to better understand intra-team relations and inter-team relations in sports videos.To tackle this problem,a novel Hierarchical Relation Network is proposed.After all players in a video are finely divided into two teams,the feature of the two teams’activities and interactions will be enhanced by Graph Convolutional Networks,which are finally recognized to generate Group Scene Graph.For evaluation,built on Volleyball dataset with additional 9660 team activity labels,a Volleyball+dataset is proposed.A baseline is set for better comparison and our experimental results demonstrate the effectiveness of our method.Moreover,the idea of our method can be directly utilized in another video-based task,Group Activity Recognition.Experiments show the priority of our method and display the link between the two tasks.Finally,from the athlete’s view,we elaborately present an interpretation that shows how to utilize Group Scene Graph to analyze teams’activities and provide professional gaming suggestions.展开更多
Social relationships,such as parent-offspring and friends,are crucial and stable connections between individuals,especially at the person level,and are essential for accurately describing the semantics of videos.In th...Social relationships,such as parent-offspring and friends,are crucial and stable connections between individuals,especially at the person level,and are essential for accurately describing the semantics of videos.In this paper,we analogize such a task to scene graph generation,which we call video social relationship graph generation(VSRGG).It involves generating a social relationship graph for each video based on person-level relationships.We propose a context-aware graph neural network(CAGNet)for VSRGG,which effectively generates social relationship graphs through message passing,capturing the context of the video.Specifically,CAGNet detects persons in the video,generates an initial graph via relationship proposal,and extracts facial and body features to describe the detected individuals,as well as temporal features to describe their interactions.Then,CAGNet predicts pairwise relationships between individuals using graph message passing.Additionally,we construct a new dataset,VidSoR,to evaluate VSRGG,which contains 72 h of video with 6276 person instances and 5313 relationship instances of eight relationship types.Extensive experiments show that CAGNet can make accurate predictions with a comparatively high mean recall(mRecall)when using only visual features.展开更多
基金This work was supported by the National Natural Science Foundation of China(Nos.62173045 and 61673192)the Fundamental Research Funds for the Central Universities(No.2020XD-A04-2)the BUPT Excellent PhD Students Foundation(No.CX2021222).
文摘Scene graphs of point clouds help to understand object-level relationships in the 3D space.Most graph generation methods work on 2D structured data,which cannot be used for the 3D unstructured point cloud data.Existing point-cloud-based methods generate the scene graph with an additional graph structure that needs labor-intensive manual annotation.To address these problems,we explore a method to convert the point clouds into structured data and generate graphs without given structures.Specifically,we cluster points with similar augmented features into groups and establish their relationships,resulting in an initial structural representation of the point cloud.Besides,we propose a Dynamic Graph Generation Network(DGGN)to judge the semantic labels of targets of different granularity.It dynamically splits and merges point groups,resulting in a scene graph with high precision.Experiments show that our methods outperform other baseline methods.They output reliable graphs describing the object-level relationships without additional manual labeled data.
基金supported by the National Natural Science Foundation of China(U21A20515,62102393,62206263,62271467)Beijing Natural Science Foundation(4242053).
文摘Existing GAN-based generative methodsare typically used for semantic image synthesis. Wepose the question of whether GAN-based architecturescan generate plausible depth maps and find thatexisting methods have difficulty in generating depthmaps which reasonably represent 3D scene structuredue to the lack of global geometric correlations.Thus, we propose DepthGAN, a novel method ofgenerating a depth map using a semantic layout asinput to aid construction, and manipulation of wellstructured 3D scene point clouds. Specifically, wefirst build a feature generation model with a cascadeof semantically-aware transformer blocks to obtaindepth features with global structural information.For our semantically aware transformer block, wepropose a mixed attention module and a semanticallyaware layer normalization module to better exploitsemantic consistency for depth features generation.Moreover, we present a novel semantically weighteddepth synthesis module, which generates adaptivedepth intervals for the current scene. We generate thefinal depth map by using a weighted combination ofsemantically aware depth weights for different depthranges. In this manner, we obtain a more accuratedepth map. Extensive experiments on indoor andoutdoor datasets demonstrate that DepthGAN achievessuperior results both quantitatively and visually for thedepth generation task.
基金supported by the International(Regional)Cooperation and Exchange Program of National Natural Science Foundation of China under Grant No.62161146002the Shenzhen Collaborative Innovation Program under Grant No.CJGJZD2021048092601003.
文摘We present SinGRAV, an attempt to learn a generative radiance volume from multi-view observations of a single natural scene, in stark contrast to existing category-level 3D generative models that learn from images of many object-centric scenes. Inspired by SinGAN, we also learn the internal distribution of the input scene, which necessitates our key designs w.r.t. the scene representation and network architecture. Unlike popular multi-layer perceptrons (MLP)-based architectures, we particularly employ convolutional generators and discriminators, which inherently possess spatial locality bias, to operate over voxelized volumes for learning the internal distribution over a plethora of overlapping regions. On the other hand, localizing the adversarial generators and discriminators over confined areas with limited receptive fields easily leads to highly implausible geometric structures in the spatial. Our remedy is to use spatial inductive bias and joint discrimination on geometric clues in the form of 2D depth maps. This strategy is effective in improving spatial arrangement while incurring negligible additional computational cost. Experimental results demonstrate the ability of SinGRAV in generating plausible and diverse variations from a single scene, the merits of SinGRAV over state-of-the-art generative neural scene models, and the versatility of SinGRAV by its use in a variety of applications. Code and data will be released to facilitate further research.
基金supported by the National Natural Science Foundation of China(62322210)Beijing Municipal Natural Science Foundation for Distinguished Young Scholars(JQ21013)+1 种基金Beijing Municipal Science and Technology Commission(Z231100005923031)2023 Tencent AI Lab Rhino-Bird Focused Research Program.
文摘The emergence of 3D Gaussian splatting(3DGS)has greatly accelerated rendering in novel view synthesis.Unlike neural implicit representations like neural radiance fields(NeRFs)that represent a 3D scene with position and viewpoint-conditioned neural networks,3D Gaussian splatting utilizes a set of Gaussian ellipsoids to model the scene so that efficient rendering can be accomplished by rasterizing Gaussian ellipsoids into images.Apart from fast rendering,the explicit representation of 3D Gaussian splatting also facilitates downstream tasks like dynamic reconstruction,geometry editing,and physical simulation.Considering the rapid changes and growing number of works in this field,we present a literature review of recent 3D Gaussian splatting methods,which can be roughly classified by functionality into 3D reconstruction,3D editing,and other downstream applications.Traditional point-based rendering methods and the rendering formulation of 3D Gaussian splatting are also covered to aid understanding of this technique.This survey aims to help beginners to quickly get started in this field and to provide experienced researchers with a comprehensive overview,aiming to stimulate future development of the 3D Gaussian splatting representation.
基金supported by the Key Technological Innovation Projects of Hubei Province of China under Grant No.2018AAA062the Wuhan Science and Technology Plan Project of Hubei Province of China under Grant No.2017010201010109,the National Key Research and Development Program of China under Grant No.2017YFB1002600the National Natural Science Foundation of China under Grant Nos.61672390 and 61972298.
文摘Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem.It needs to solve several difficult tasks across the fields of natural language processing and computer vision.We model it as a combination of semantic entity recognition,object retrieval and recombination,and objects’status optimization.To reach a satisfactory result,we propose a comprehensive pipeline to convert the input text to its visual counterpart.The pipeline includes text processing,foreground objects and background scene retrieval,image synthesis using constrained MCMC,and post-processing.Firstly,we roughly divide the objects parsed from the input text into foreground objects and background scenes.Secondly,we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset,and retrieve an appropriate background scene image from the background image dataset extracted from the Internet.Thirdly,in order to ensure the rationality of foreground objects’positions and sizes in the image synthesis step,we design a cost function and use the Markov Chain Monte Carlo(MCMC)method as the optimizer to solve this constrained layout problem.Finally,to make the image look natural and harmonious,we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step.The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks(GANs)in visual quality of generated scene images.
基金National Natural Science Foundation of China(Grant No.U20B2069)Fundamental Research Funds for the Central Universities.
文摘Learning activities interactions between small groups is a key step in understanding team sports videos.Recent research focusing on team sports videos can be strictly regarded from the perspective of the audience rather than the athlete.For team sports videos such as volleyball and basketball videos,there are plenty of intra-team and inter-team relations.In this paper,a new task named Group Scene Graph Generation is introduced to better understand intra-team relations and inter-team relations in sports videos.To tackle this problem,a novel Hierarchical Relation Network is proposed.After all players in a video are finely divided into two teams,the feature of the two teams’activities and interactions will be enhanced by Graph Convolutional Networks,which are finally recognized to generate Group Scene Graph.For evaluation,built on Volleyball dataset with additional 9660 team activity labels,a Volleyball+dataset is proposed.A baseline is set for better comparison and our experimental results demonstrate the effectiveness of our method.Moreover,the idea of our method can be directly utilized in another video-based task,Group Activity Recognition.Experiments show the priority of our method and display the link between the two tasks.Finally,from the athlete’s view,we elaborately present an interpretation that shows how to utilize Group Scene Graph to analyze teams’activities and provide professional gaming suggestions.
基金supported by the National Natural Science Foundation of China(No.62072232)the Fundamental Research Funds for the Central Universities(No.021714380026)the Collaborative Innovation Center of Novel Software Technology and Industrialization.
文摘Social relationships,such as parent-offspring and friends,are crucial and stable connections between individuals,especially at the person level,and are essential for accurately describing the semantics of videos.In this paper,we analogize such a task to scene graph generation,which we call video social relationship graph generation(VSRGG).It involves generating a social relationship graph for each video based on person-level relationships.We propose a context-aware graph neural network(CAGNet)for VSRGG,which effectively generates social relationship graphs through message passing,capturing the context of the video.Specifically,CAGNet detects persons in the video,generates an initial graph via relationship proposal,and extracts facial and body features to describe the detected individuals,as well as temporal features to describe their interactions.Then,CAGNet predicts pairwise relationships between individuals using graph message passing.Additionally,we construct a new dataset,VidSoR,to evaluate VSRGG,which contains 72 h of video with 6276 person instances and 5313 relationship instances of eight relationship types.Extensive experiments show that CAGNet can make accurate predictions with a comparatively high mean recall(mRecall)when using only visual features.