Research on neural radiance fields for novel view synthesis has experienced explosive growth with the development of new models and extensions.The NeRF(Neural Radiance Fields)algorithm,suitable for underwater scenes o...Research on neural radiance fields for novel view synthesis has experienced explosive growth with the development of new models and extensions.The NeRF(Neural Radiance Fields)algorithm,suitable for underwater scenes or scattering media,is also evolving.Existing underwater 3D reconstruction systems still face challenges such as long training times and low rendering efficiency.This paper proposes an improved underwater 3D reconstruction system to achieve rapid and high-quality 3D reconstruction.First,we enhance underwater videos captured by a monocular camera to correct the image quality degradation caused by the physical properties of the water medium and ensure consistency in enhancement across frames.Then,we perform keyframe selection to optimize resource usage and reduce the impact of dynamic objects on the reconstruction results.After pose estimation using COLMAP,the selected keyframes undergo 3D reconstruction using neural radiance fields(NeRF)based on multi-resolution hash encoding for model construction and rendering.In terms of image enhancement,our method has been optimized in certain scenarios,demonstrating effectiveness in image enhancement and better continuity between consecutive frames of the same data.In terms of 3D reconstruction,our method achieved a peak signal-to-noise ratio(PSNR)of 18.40 dB and a structural similarity(SSIM)of 0.6677,indicating a good balance between operational efficiency and reconstruction quality.展开更多
This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axi...This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axis based approach, finding corresponding lines using feature based matching method, and 3D line depth computation.展开更多
In dynamic scenarios,visual simultaneous localization and mapping(SLAM)algorithms often incorrectly incorporate dynamic points during camera pose computation,leading to reduced accuracy and robustness.This paper prese...In dynamic scenarios,visual simultaneous localization and mapping(SLAM)algorithms often incorrectly incorporate dynamic points during camera pose computation,leading to reduced accuracy and robustness.This paper presents a dynamic SLAM algorithm that leverages object detection and regional dynamic probability.Firstly,a parallel thread employs the YOLOX object detectionmodel to gather 2D semantic information and compensate for missed detections.Next,an improved K-means++clustering algorithm clusters bounding box regions,adaptively determining the threshold for extracting dynamic object contours as dynamic points change.This process divides the image into low dynamic,suspicious dynamic,and high dynamic regions.In the tracking thread,the dynamic point removal module assigns dynamic probability weights to the feature points in these regions.Combined with geometric methods,it detects and removes the dynamic points.The final evaluation on the public TUM RGB-D dataset shows that the proposed dynamic SLAM algorithm surpasses most existing SLAM algorithms,providing better pose estimation accuracy and robustness in dynamic environments.展开更多
With the upgrading of tourism consumption patterns,the traditional renovation models of waterfront recreational spaces centered on landscape design can no longer meet the commercial and humanistic demands of modern cu...With the upgrading of tourism consumption patterns,the traditional renovation models of waterfront recreational spaces centered on landscape design can no longer meet the commercial and humanistic demands of modern cultural and tourism development.Based on scene theory as the analytical framework and taking the Xuan en Night Banquet Project in Enshi as a case study,this paper explores the design pathway for transforming waterfront areas in tourism cities from"spatial reconstruction"to"scene construction".The study argues that waterfront space renewal should transcend mere physical renovation.By implementing three core strategies:spatial narrative framework,ecological industry creation,and cultural empowerment,it is possible to construct integrated scenarios that blend cultural value,consumption spaces,and lifestyle elements.This approach ultimately fosters sustained vitality in waterfront areas and promotes the high-quality development of cultural and tourism industry.展开更多
Crime scene investigation(CSI)is an important link in the criminal justice system as it serves as a bridge between establishing the happenings during an incident and possibly identifying the accountable persons,provid...Crime scene investigation(CSI)is an important link in the criminal justice system as it serves as a bridge between establishing the happenings during an incident and possibly identifying the accountable persons,providing light in the dark.The International Organization for Standardization(ISO)and the International Electrotechnical Commission(IEC)collaborated to develop the ISO/IEC 17020:2012 standard to govern the quality of CSI,a branch of inspection activity.These protocols include the impartiality and competence of the crime scene investigators involved,contemporary recording of scene observations and data obtained,the correct use of resources during scene processing,forensic evidence collection and handling procedures,and the confidentiality and integrity of any scene information obtained from other parties etc.The preparatory work,the accreditation processes involved and the implementation of new quality measures to the existing quality management system in order to achieve the ISO/IE 17020:2012 accreditation at the Forensic Science Division of the Government Laboratory in Hong Kong are discussed in this paper.展开更多
Scene graph prediction has emerged as a critical task in computer vision,focusing on transforming complex visual scenes into structured representations by identifying objects,their attributes,and the relationships amo...Scene graph prediction has emerged as a critical task in computer vision,focusing on transforming complex visual scenes into structured representations by identifying objects,their attributes,and the relationships among them.Extending this to 3D semantic scene graph(3DSSG)prediction introduces an additional layer of complexity because it requires the processing of point-cloud data to accurately capture the spatial and volumetric characteristics of a scene.A significant challenge in 3DSSG is the long-tailed distribution of object and relationship labels,causing certain classes to be severely underrepresented and suboptimal performance in these rare categories.To address this,we proposed a fusion prototypical network(FPN),which combines the strengths of conventional neural networks for 3DSSG with a Prototypical Network.The former are known for their ability to handle complex scene graph predictions while the latter excels in few-shot learning scenarios.By leveraging this fusion,our approach enhances the overall prediction accuracy and substantially improves the handling of underrepresented labels.Through extensive experiments using the 3DSSG dataset,we demonstrated that the FPN achieves state-of-the-art performance in 3D scene graph prediction as a single model and effectively mitigates the impact of the long-tailed distribution,providing a more balanced and comprehensive understanding of complex 3D environments.展开更多
Remote sensing scene image classification is a prominent research area within remote sensing.Deep learningbased methods have been extensively utilized and have shown significant advancements in this field.Recent progr...Remote sensing scene image classification is a prominent research area within remote sensing.Deep learningbased methods have been extensively utilized and have shown significant advancements in this field.Recent progress in these methods primarily focuses on enhancing feature representation capabilities to improve performance.The challenge lies in the limited spatial resolution of small-sized remote sensing images,as well as image blurring and sparse data.These factors contribute to lower accuracy in current deep learning models.Additionally,deeper networks with attention-based modules require a substantial number of network parameters,leading to high computational costs and memory usage.In this article,we introduce ERSNet,a lightweight novel attention-guided network for remote sensing scene image classification.ERSNet is constructed using a deep separable convolutional network and incorporates an attention mechanism.It utilizes spatial attention,channel attention,and channel self-attention to enhance feature representation and accuracy,while also reducing computational complexity and memory usage.Experimental results indicate that,compared to existing state-of-the-art methods,ERSNet has a significantly lower parameter count of only 1.2 M and reduced Flops.It achieves the highest classification accuracy of 99.14%on the EuroSAT dataset,demonstrating its suitability for application on mobile terminal devices.Furthermore,experimental results from the UCMerced land use dataset and the Brazilian coffee scene also confirm the strong generalization ability of this method.展开更多
This paper presents a comprehensive framework that enables communication scene recognition through deep learning and multi-sensor fusion.This study aims to address the challenge of current communication scene recognit...This paper presents a comprehensive framework that enables communication scene recognition through deep learning and multi-sensor fusion.This study aims to address the challenge of current communication scene recognition methods that struggle to adapt in dynamic environments,as they typically rely on post-response mechanisms that fail to detect scene changes before users experience latency.The proposed framework leverages data from multiple smartphone sensors,including acceleration sensors,gyroscopes,magnetic field sensors,and orientation sensors,to identify different communication scenes,such as walking,running,cycling,and various modes of transportation.Extensive experimental comparative analysis with existing methods on the open-source SHL-2018 dataset confirmed the superior performance of our approach in terms of F1 score and processing speed.Additionally,tests using a Microsoft Surface Pro tablet and a self-collected Beijing-2023 dataset have validated the framework's efficiency and generalization capability.The results show that our framework achieved an F1 score of 95.15%on SHL-2018and 94.6%on Beijing-2023,highlighting its robustness across different datasets and conditions.Furthermore,the levels of computational complexity and power consumption associated with the algorithm are moderate,making it suitable for deployment on mobile devices.展开更多
Recognizing road scene context from a single image remains a critical challenge for intelligent autonomous driving systems,particularly in dynamic and unstructured environments.While recent advancements in deep learni...Recognizing road scene context from a single image remains a critical challenge for intelligent autonomous driving systems,particularly in dynamic and unstructured environments.While recent advancements in deep learning have significantly enhanced road scene classification,simultaneously achieving high accuracy,computational efficiency,and adaptability across diverse conditions continues to be difficult.To address these challenges,this study proposes HybridLSTM,a novel and efficient framework that integrates deep learning-based,object-based,and handcrafted feature extraction methods within a unified architecture.HybridLSTM is designed to classify four distinct road scene categories—crosswalk(CW),highway(HW),overpass/tunnel(OP/T),and parking(P)—by leveraging multiple publicly available datasets,including Places-365,BDD100K,LabelMe,and KITTI,thereby promoting domain generalization.The framework fuses object-level features extracted using YOLOv5 and VGG19,scene-level global representations obtained from a modified VGG19,and fine-grained texture features captured through eight handcrafted descriptors.This hybrid feature fusion enables the model to capture both semantic context and low-level visual cues,which are critical for robust scene understanding.To model spatial arrangements and latent sequential dependencies present even in static imagery,the combined features are processed through a Long Short-Term Memory(LSTM)network,allowing the extraction of discriminative patterns across heterogeneous feature spaces.Extensive experiments conducted on 2725 annotated road scene images,with an 80:20 training-to-testing split,validate the effectiveness of the proposed model.HybridLSTM achieves a classification accuracy of 96.3%,a precision of 95.8%,a recall of 96.1%,and an F1-score of 96.0%,outperforming several existing state-of-the-art methods.These results demonstrate the robustness,scalability,and generalization capability of HybridLSTM across varying environments and scene complexities.Moreover,the framework is optimized to balance classification performance with computational efficiency,making it highly suitable for real-time deployment in embedded autonomous driving systems.Future work will focus on extending the model to multi-class detection within a single frame and optimizing it further for edge-device deployments to reduce computational overhead in practical applications.展开更多
The autonomous landing guidance of fixed-wing aircraft in unknown structured scenes presents a substantial technological challenge,particularly regarding the effectiveness of solutions for monocular visual relative po...The autonomous landing guidance of fixed-wing aircraft in unknown structured scenes presents a substantial technological challenge,particularly regarding the effectiveness of solutions for monocular visual relative pose estimation.This study proposes a novel airborne monocular visual estimation method based on structured scene features to address this challenge.First,a multitask neural network model is established for segmentation,depth estimation,and slope estimation on monocular images.And a monocular image comprehensive three-dimensional information metric is designed,encompassing length,span,flatness,and slope information.Subsequently,structured edge features are leveraged to filter candidate landing regions adaptively.By leveraging the three-dimensional information metric,the optimal landing region is accurately and efficiently identified.Finally,sparse two-dimensional key point is used to parameterize the optimal landing region for the first time and a high-precision relative pose estimation is achieved.Additional measurement information is introduced to provide the autonomous landing guidance information between the aircraft and the optimal landing region.Experimental results obtained from both synthetic and real data demonstrate the effectiveness of the proposed method in monocular pose estimation for autonomous aircraft landing guidance in unknown structured scenes.展开更多
In the dynamic scene of autonomous vehicles,the depth estimation of monocular cameras often faces the problem of inaccurate edge depth estimation.To solve this problem,we propose an unsupervised monocular depth estima...In the dynamic scene of autonomous vehicles,the depth estimation of monocular cameras often faces the problem of inaccurate edge depth estimation.To solve this problem,we propose an unsupervised monocular depth estimation model based on edge enhancement,which is specifically aimed at the depth perception challenge in dynamic scenes.The model consists of two core networks:a deep prediction network and a motion estimation network,both of which adopt an encoder-decoder architecture.The depth prediction network is based on the U-Net structure of ResNet18,which is responsible for generating the depth map of the scene.The motion estimation network is based on the U-Net structure of Flow-Net,focusing on the motion estimation of dynamic targets.In the decoding stage of the motion estimation network,we innovatively introduce an edge-enhanced decoder,which integrates a convolutional block attention module(CBAM)in the decoding process to enhance the recognition ability of the edge features of moving objects.In addition,we also designed a strip convolution module to improve the model’s capture efficiency of discrete moving targets.To further improve the performance of the model,we propose a novel edge regularization method based on the Laplace operator,which effectively accelerates the convergence process of themodel.Experimental results on the KITTI and Cityscapes datasets show that compared with the current advanced dynamic unsupervised monocular model,the proposed model has a significant improvement in depth estimation accuracy and convergence speed.Specifically,the rootmean square error(RMSE)is reduced by 4.8%compared with the DepthMotion algorithm,while the training convergence speed is increased by 36%,which shows the superior performance of the model in the depth estimation task in dynamic scenes.展开更多
Semantic segmentation in street scenes is a crucial technology for autonomous driving to analyze the surrounding environment.In street scenes,issues such as high image resolution caused by a large viewpoints and diffe...Semantic segmentation in street scenes is a crucial technology for autonomous driving to analyze the surrounding environment.In street scenes,issues such as high image resolution caused by a large viewpoints and differences in object scales lead to a decline in real-time performance and difficulties in multi-scale feature extraction.To address this,we propose a bilateral-branch real-time semantic segmentationmethod based on semantic information distillation(BSDNet)for street scene images.The BSDNet consists of a Feature Conversion Convolutional Block(FCB),a Semantic Information Distillation Module(SIDM),and a Deep Aggregation Atrous Convolution Pyramid Pooling(DASP).FCB reduces the semantic gap between the backbone and the semantic branch.SIDM extracts high-quality semantic information fromthe Transformer branch to reduce computational costs.DASP aggregates information lost in atrous convolutions,effectively capturingmulti-scale objects.Extensive experiments conducted on Cityscapes,CamVid,and ADE20K,achieving an accuracy of 81.7% Mean Intersection over Union(mIoU)at 70.6 Frames Per Second(FPS)on Cityscapes,demonstrate that our method achieves a better balance between accuracy and inference speed.展开更多
Today,autonomous mobile robots are widely used in all walks of life.Autonomous navigation,as a basic capability of robots,has become a research hotspot.Classical navigation techniques,which rely on pre-built maps,stru...Today,autonomous mobile robots are widely used in all walks of life.Autonomous navigation,as a basic capability of robots,has become a research hotspot.Classical navigation techniques,which rely on pre-built maps,struggle to cope with complex and dynamic environments.With the development of artificial intelligence,learning-based navigation technology have emerged.Instead of relying on pre-built maps,the agent perceives the environment and make decisions through visual observation,enabling end-to-end navigation.A key challenge is to enhance the generalization ability of the agent in unfamiliar environments.To tackle this challenge,it is necessary to endow the agent with spatial intelligence.Spatial intelligence refers to the ability of the agent to transform visual observations into insights,in-sights into understanding,and understanding into actions.To endow the agent with spatial intelligence,relevant research uses scene graph to represent the environment.We refer to this method as scene graph-based object goal navigation.In this paper,we concentrate on scene graph,offering formal description,computational framework of object goal navigation.We provide a comprehensive summary of the meth-ods for constructing and applying scene graph.Additionally,we present experimental evidence that highlights the critical role of scene graph in improving navigation success.This paper also delineates promising research directions,all aimed at sharpening the focus on scene graph.Overall,this paper shows how scene graph endows the agent with spatial intelligence,aiming to promote the importance of scene graph in the field of intelligent navigation.展开更多
Self-supervised monocular depth estimation has emerged as a major research focus in recent years,primarily due to the elimination of ground-truth depth dependence.However,the prevailing architectures in this domain su...Self-supervised monocular depth estimation has emerged as a major research focus in recent years,primarily due to the elimination of ground-truth depth dependence.However,the prevailing architectures in this domain suffer from inherent limitations:existing pose network branches infer camera ego-motion exclusively under static-scene and Lambertian-surface assumptions.These assumptions are often violated in real-world scenarios due to dynamic objects,non-Lambertian reflectance,and unstructured background elements,leading to pervasive artifacts such as depth discontinuities(“holes”),structural collapse,and ambiguous reconstruction.To address these challenges,we propose a novel framework that integrates scene dynamic pose estimation into the conventional self-supervised depth network,enhancing its ability to model complex scene dynamics.Our contributions are threefold:(1)a pixel-wise dynamic pose estimation module that jointly resolves the pose transformations of moving objects and localized scene perturbations;(2)a physically-informed loss function that couples dynamic pose and depth predictions,designed to mitigate depth errors arising from high-speed distant objects and geometrically inconsistent motion profiles;(3)an efficient SE(3)transformation parameterization that streamlines network complexity and temporal pre-processing.Extensive experiments on the KITTI and NYU-V2 benchmarks show that our framework achieves state-of-the-art performance in both quantitative metrics and qualitative visual fidelity,significantly improving the robustness and generalization of monocular depth estimation under dynamic conditions.展开更多
Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions...Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.展开更多
As eye tracking can be used to record moment-to-moment changes of eye movements as people inspect pictures of natural scenes and comprehend information, this paper attempts to use eye-movement technology to investigat...As eye tracking can be used to record moment-to-moment changes of eye movements as people inspect pictures of natural scenes and comprehend information, this paper attempts to use eye-movement technology to investigate how the order of presentation and the characteristics of information affect the semantic mismatch effect in the picture-sentence paradigm. A 3(syntax)×2(semantic relation) factorial design is adopted, with syntax and semantic relations as within-participant variables. The experiment finds that the semantic mismatch is most likely to increase cognitive loads as people have to spend more time, including first-pass time, regression path duration, and total fixation duration. Double negation does not significantly increase the processing difficulty of pictures and information. Experimental results show that people can extract the special syntactic strategy from long-term memory to process pictures and sentences with different semantic relations. It enables readers to comprehend double negation as affirmation. These results demonstrate that the constituent comparison model may not be a general model regarding other languages.展开更多
[Objective] The aim was to establish a model based on spatial scene similarity, for which soil, slope, transport, water conservancy, light, social economic factors in suitable planting areas were all considered. A new...[Objective] The aim was to establish a model based on spatial scene similarity, for which soil, slope, transport, water conservancy, light, social economic factors in suitable planting areas were all considered. A new suitable planting area of flue-cured tobacco was determined by comparison and analysis, with consideration of excellent area. [Method] Totaling thirty natural factors were chosen, which were clas- sified into nine categories, from Longpeng Town (LP) and Shaochong Town (SC) in Shiping County in Honghe Hani and Yi Autonomous Prefecture. [Result] According to weights, the factors from high to low were as follows: soil〉light〉elevation〉slope〉 water conservancy〉transport〉baking facility〉planting plans over the years〉others. The similarity of geographical conditions in the area was 0.894 3, which indicated that the planting conditions in the two regions are similar. If farmer population in unit area, farmland quantity for individual farmer, labors in every household, activity in planting flue-cured tobacco and work of local instructor were considered, the weights of different factors were as follows: farmer population in unit area〉farmland quantity for individual farmer〉farmers' activity in planting flue-cured tobacco〉educational back- ground〉labor force in every household〉instructor〉population of farmers' children at- tending school. The similarity of geographical conditions was 0.703 1, which indicated that it is none-natural factors that influence yield and quality of flue-cured tobacco. [Conclusion] According to analysis on suitable planting area of flue-cured tobacco based on assessment of spatial scene similarity, similarity of growing conditions in two spatial scenes can be analyzed and evaluated, which would promote further exploration on, influencing factors and effects on tobacco production.展开更多
A method of multi-block Single Shot Multi Box Detector(SSD)based on small object detection is proposed to the railway scene of unmanned aerial vehicle surveillance.To address the limitation of small object detection,a...A method of multi-block Single Shot Multi Box Detector(SSD)based on small object detection is proposed to the railway scene of unmanned aerial vehicle surveillance.To address the limitation of small object detection,a multi-block SSD mechanism,which consists of three steps,is designed.First,the original input images are segmented into several overlapped patches.Second,each patch is separately fed into an SSD to detect the objects.Third,the patches are merged together through two stages.In the first stage,the truncated object of the sub-layer detection result is spliced.In the second stage,a sub-layer suppression and filtering algorithm applying the concept of non-maximum suppression is utilized to remove the overlapped boxes of sub-layers.The boxes that are not detected in the main-layer are retained.In addition,no sufficient labeled training samples of railway circumstance are available,thereby hindering the deployment of SSD.A two-stage training strategy leveraging to transfer learning is adopted to solve this issue.The deep learning model is preliminarily trained using labeled data of numerous auxiliaries,and then it is refined using only a few samples of railway scene.A railway spot in China,which is easily damaged by landslides,is investigated as a case study.Experimental results show that the proposed multi-block SSD method produces an overall accuracy of 96.6%and obtains an improvement of up to 9.2%compared with the traditional SSD.展开更多
Realizing autonomy is a hot research topic for automatic vehicles in recent years. For a long time, most of the efforts to this goal concentrate on understanding the scenes surrounding the ego-vehicle(autonomous vehi...Realizing autonomy is a hot research topic for automatic vehicles in recent years. For a long time, most of the efforts to this goal concentrate on understanding the scenes surrounding the ego-vehicle(autonomous vehicle itself). By completing lowlevel vision tasks, such as detection, tracking and segmentation of the surrounding traffic participants, e.g., pedestrian, cyclists and vehicles, the scenes can be interpreted. However, for an autonomous vehicle, low-level vision tasks are largely insufficient to give help to comprehensive scene understanding. What are and how about the past, the on-going and the future of the scene participants? This deep question actually steers the vehicles towards truly full automation, just like human beings. Based on this thoughtfulness, this paper attempts to investigate the interpretation of traffic scene in autonomous driving from an event reasoning view. To reach this goal, we study the most relevant literatures and the state-of-the-arts on scene representation, event detection and intention prediction in autonomous driving. In addition, we also discuss the open challenges and problems in this field and endeavor to provide possible solutions.展开更多
基金This work was supported by the Key Research and Development Program of Hainan Province(Grant Nos.ZDYF2023GXJS163,ZDYF2024GXJS014)National Natural Science Foundation of China(NSFC)(Grant Nos.62162022,62162024)+2 种基金the Major Science and Technology Project of Hainan Province(Grant No.ZDKJ2020012)Hainan Provincial Natural Science Foundation of China(Grant No.620MS021)Youth Foundation Project of Hainan Natural Science Foundation(621QN211).
文摘Research on neural radiance fields for novel view synthesis has experienced explosive growth with the development of new models and extensions.The NeRF(Neural Radiance Fields)algorithm,suitable for underwater scenes or scattering media,is also evolving.Existing underwater 3D reconstruction systems still face challenges such as long training times and low rendering efficiency.This paper proposes an improved underwater 3D reconstruction system to achieve rapid and high-quality 3D reconstruction.First,we enhance underwater videos captured by a monocular camera to correct the image quality degradation caused by the physical properties of the water medium and ensure consistency in enhancement across frames.Then,we perform keyframe selection to optimize resource usage and reduce the impact of dynamic objects on the reconstruction results.After pose estimation using COLMAP,the selected keyframes undergo 3D reconstruction using neural radiance fields(NeRF)based on multi-resolution hash encoding for model construction and rendering.In terms of image enhancement,our method has been optimized in certain scenarios,demonstrating effectiveness in image enhancement and better continuity between consecutive frames of the same data.In terms of 3D reconstruction,our method achieved a peak signal-to-noise ratio(PSNR)of 18.40 dB and a structural similarity(SSIM)of 0.6677,indicating a good balance between operational efficiency and reconstruction quality.
文摘This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axis based approach, finding corresponding lines using feature based matching method, and 3D line depth computation.
基金the National Natural Science Foundation of China(No.62063006)to the Guangxi Natural Science Foundation under Grant(Nos.2023GXNSFAA026025,AA24010001)+3 种基金to the Innovation Fund of Chinese Universities Industry-University-Research(ID:2023RY018)to the Special Guangxi Industry and Information Technology Department,Textile and Pharmaceutical Division(ID:2021 No.231)to the Special Research Project of Hechi University(ID:2021GCC028)to the Key Laboratory of AI and Information Processing,Education Department of Guangxi Zhuang Autonomous Region(Hechi University),No.2024GXZDSY009。
文摘In dynamic scenarios,visual simultaneous localization and mapping(SLAM)algorithms often incorrectly incorporate dynamic points during camera pose computation,leading to reduced accuracy and robustness.This paper presents a dynamic SLAM algorithm that leverages object detection and regional dynamic probability.Firstly,a parallel thread employs the YOLOX object detectionmodel to gather 2D semantic information and compensate for missed detections.Next,an improved K-means++clustering algorithm clusters bounding box regions,adaptively determining the threshold for extracting dynamic object contours as dynamic points change.This process divides the image into low dynamic,suspicious dynamic,and high dynamic regions.In the tracking thread,the dynamic point removal module assigns dynamic probability weights to the feature points in these regions.Combined with geometric methods,it detects and removes the dynamic points.The final evaluation on the public TUM RGB-D dataset shows that the proposed dynamic SLAM algorithm surpasses most existing SLAM algorithms,providing better pose estimation accuracy and robustness in dynamic environments.
文摘With the upgrading of tourism consumption patterns,the traditional renovation models of waterfront recreational spaces centered on landscape design can no longer meet the commercial and humanistic demands of modern cultural and tourism development.Based on scene theory as the analytical framework and taking the Xuan en Night Banquet Project in Enshi as a case study,this paper explores the design pathway for transforming waterfront areas in tourism cities from"spatial reconstruction"to"scene construction".The study argues that waterfront space renewal should transcend mere physical renovation.By implementing three core strategies:spatial narrative framework,ecological industry creation,and cultural empowerment,it is possible to construct integrated scenarios that blend cultural value,consumption spaces,and lifestyle elements.This approach ultimately fosters sustained vitality in waterfront areas and promotes the high-quality development of cultural and tourism industry.
文摘Crime scene investigation(CSI)is an important link in the criminal justice system as it serves as a bridge between establishing the happenings during an incident and possibly identifying the accountable persons,providing light in the dark.The International Organization for Standardization(ISO)and the International Electrotechnical Commission(IEC)collaborated to develop the ISO/IEC 17020:2012 standard to govern the quality of CSI,a branch of inspection activity.These protocols include the impartiality and competence of the crime scene investigators involved,contemporary recording of scene observations and data obtained,the correct use of resources during scene processing,forensic evidence collection and handling procedures,and the confidentiality and integrity of any scene information obtained from other parties etc.The preparatory work,the accreditation processes involved and the implementation of new quality measures to the existing quality management system in order to achieve the ISO/IE 17020:2012 accreditation at the Forensic Science Division of the Government Laboratory in Hong Kong are discussed in this paper.
基金supported by the Glocal University 30 Project Fund of Gyeongsang National University in 2025.
文摘Scene graph prediction has emerged as a critical task in computer vision,focusing on transforming complex visual scenes into structured representations by identifying objects,their attributes,and the relationships among them.Extending this to 3D semantic scene graph(3DSSG)prediction introduces an additional layer of complexity because it requires the processing of point-cloud data to accurately capture the spatial and volumetric characteristics of a scene.A significant challenge in 3DSSG is the long-tailed distribution of object and relationship labels,causing certain classes to be severely underrepresented and suboptimal performance in these rare categories.To address this,we proposed a fusion prototypical network(FPN),which combines the strengths of conventional neural networks for 3DSSG with a Prototypical Network.The former are known for their ability to handle complex scene graph predictions while the latter excels in few-shot learning scenarios.By leveraging this fusion,our approach enhances the overall prediction accuracy and substantially improves the handling of underrepresented labels.Through extensive experiments using the 3DSSG dataset,we demonstrated that the FPN achieves state-of-the-art performance in 3D scene graph prediction as a single model and effectively mitigates the impact of the long-tailed distribution,providing a more balanced and comprehensive understanding of complex 3D environments.
文摘Remote sensing scene image classification is a prominent research area within remote sensing.Deep learningbased methods have been extensively utilized and have shown significant advancements in this field.Recent progress in these methods primarily focuses on enhancing feature representation capabilities to improve performance.The challenge lies in the limited spatial resolution of small-sized remote sensing images,as well as image blurring and sparse data.These factors contribute to lower accuracy in current deep learning models.Additionally,deeper networks with attention-based modules require a substantial number of network parameters,leading to high computational costs and memory usage.In this article,we introduce ERSNet,a lightweight novel attention-guided network for remote sensing scene image classification.ERSNet is constructed using a deep separable convolutional network and incorporates an attention mechanism.It utilizes spatial attention,channel attention,and channel self-attention to enhance feature representation and accuracy,while also reducing computational complexity and memory usage.Experimental results indicate that,compared to existing state-of-the-art methods,ERSNet has a significantly lower parameter count of only 1.2 M and reduced Flops.It achieves the highest classification accuracy of 99.14%on the EuroSAT dataset,demonstrating its suitability for application on mobile terminal devices.Furthermore,experimental results from the UCMerced land use dataset and the Brazilian coffee scene also confirm the strong generalization ability of this method.
基金supported by National 2011 Collaborative Innovation Center of Wireless Communication Technologies under Grant 2242022k60006。
文摘This paper presents a comprehensive framework that enables communication scene recognition through deep learning and multi-sensor fusion.This study aims to address the challenge of current communication scene recognition methods that struggle to adapt in dynamic environments,as they typically rely on post-response mechanisms that fail to detect scene changes before users experience latency.The proposed framework leverages data from multiple smartphone sensors,including acceleration sensors,gyroscopes,magnetic field sensors,and orientation sensors,to identify different communication scenes,such as walking,running,cycling,and various modes of transportation.Extensive experimental comparative analysis with existing methods on the open-source SHL-2018 dataset confirmed the superior performance of our approach in terms of F1 score and processing speed.Additionally,tests using a Microsoft Surface Pro tablet and a self-collected Beijing-2023 dataset have validated the framework's efficiency and generalization capability.The results show that our framework achieved an F1 score of 95.15%on SHL-2018and 94.6%on Beijing-2023,highlighting its robustness across different datasets and conditions.Furthermore,the levels of computational complexity and power consumption associated with the algorithm are moderate,making it suitable for deployment on mobile devices.
文摘Recognizing road scene context from a single image remains a critical challenge for intelligent autonomous driving systems,particularly in dynamic and unstructured environments.While recent advancements in deep learning have significantly enhanced road scene classification,simultaneously achieving high accuracy,computational efficiency,and adaptability across diverse conditions continues to be difficult.To address these challenges,this study proposes HybridLSTM,a novel and efficient framework that integrates deep learning-based,object-based,and handcrafted feature extraction methods within a unified architecture.HybridLSTM is designed to classify four distinct road scene categories—crosswalk(CW),highway(HW),overpass/tunnel(OP/T),and parking(P)—by leveraging multiple publicly available datasets,including Places-365,BDD100K,LabelMe,and KITTI,thereby promoting domain generalization.The framework fuses object-level features extracted using YOLOv5 and VGG19,scene-level global representations obtained from a modified VGG19,and fine-grained texture features captured through eight handcrafted descriptors.This hybrid feature fusion enables the model to capture both semantic context and low-level visual cues,which are critical for robust scene understanding.To model spatial arrangements and latent sequential dependencies present even in static imagery,the combined features are processed through a Long Short-Term Memory(LSTM)network,allowing the extraction of discriminative patterns across heterogeneous feature spaces.Extensive experiments conducted on 2725 annotated road scene images,with an 80:20 training-to-testing split,validate the effectiveness of the proposed model.HybridLSTM achieves a classification accuracy of 96.3%,a precision of 95.8%,a recall of 96.1%,and an F1-score of 96.0%,outperforming several existing state-of-the-art methods.These results demonstrate the robustness,scalability,and generalization capability of HybridLSTM across varying environments and scene complexities.Moreover,the framework is optimized to balance classification performance with computational efficiency,making it highly suitable for real-time deployment in embedded autonomous driving systems.Future work will focus on extending the model to multi-class detection within a single frame and optimizing it further for edge-device deployments to reduce computational overhead in practical applications.
基金co-supported by the Science and Technology Innovation Program of Hunan Province,China(No.2023RC3023)the National Natural Science Foundation of China(No.12272404)。
文摘The autonomous landing guidance of fixed-wing aircraft in unknown structured scenes presents a substantial technological challenge,particularly regarding the effectiveness of solutions for monocular visual relative pose estimation.This study proposes a novel airborne monocular visual estimation method based on structured scene features to address this challenge.First,a multitask neural network model is established for segmentation,depth estimation,and slope estimation on monocular images.And a monocular image comprehensive three-dimensional information metric is designed,encompassing length,span,flatness,and slope information.Subsequently,structured edge features are leveraged to filter candidate landing regions adaptively.By leveraging the three-dimensional information metric,the optimal landing region is accurately and efficiently identified.Finally,sparse two-dimensional key point is used to parameterize the optimal landing region for the first time and a high-precision relative pose estimation is achieved.Additional measurement information is introduced to provide the autonomous landing guidance information between the aircraft and the optimal landing region.Experimental results obtained from both synthetic and real data demonstrate the effectiveness of the proposed method in monocular pose estimation for autonomous aircraft landing guidance in unknown structured scenes.
基金funded by the Yangtze River Delta Science and Technology Innovation Community Joint Research Project(2023CSJGG1600)the Natural Science Foundation of Anhui Province(2208085MF173)Wuhu“ChiZhu Light”Major Science and Technology Project(2023ZD01,2023ZD03).
文摘In the dynamic scene of autonomous vehicles,the depth estimation of monocular cameras often faces the problem of inaccurate edge depth estimation.To solve this problem,we propose an unsupervised monocular depth estimation model based on edge enhancement,which is specifically aimed at the depth perception challenge in dynamic scenes.The model consists of two core networks:a deep prediction network and a motion estimation network,both of which adopt an encoder-decoder architecture.The depth prediction network is based on the U-Net structure of ResNet18,which is responsible for generating the depth map of the scene.The motion estimation network is based on the U-Net structure of Flow-Net,focusing on the motion estimation of dynamic targets.In the decoding stage of the motion estimation network,we innovatively introduce an edge-enhanced decoder,which integrates a convolutional block attention module(CBAM)in the decoding process to enhance the recognition ability of the edge features of moving objects.In addition,we also designed a strip convolution module to improve the model’s capture efficiency of discrete moving targets.To further improve the performance of the model,we propose a novel edge regularization method based on the Laplace operator,which effectively accelerates the convergence process of themodel.Experimental results on the KITTI and Cityscapes datasets show that compared with the current advanced dynamic unsupervised monocular model,the proposed model has a significant improvement in depth estimation accuracy and convergence speed.Specifically,the rootmean square error(RMSE)is reduced by 4.8%compared with the DepthMotion algorithm,while the training convergence speed is increased by 36%,which shows the superior performance of the model in the depth estimation task in dynamic scenes.
基金supported in part by the National Natural Science Foundation of China[Grant number 62471075]the Major Science and Technology Project Grant of the Chongqing Municipal Education Commission[Grant number KJZD-M202301901]Graduate Innovation Fund of Chongqing[gzlcx20253235].
文摘Semantic segmentation in street scenes is a crucial technology for autonomous driving to analyze the surrounding environment.In street scenes,issues such as high image resolution caused by a large viewpoints and differences in object scales lead to a decline in real-time performance and difficulties in multi-scale feature extraction.To address this,we propose a bilateral-branch real-time semantic segmentationmethod based on semantic information distillation(BSDNet)for street scene images.The BSDNet consists of a Feature Conversion Convolutional Block(FCB),a Semantic Information Distillation Module(SIDM),and a Deep Aggregation Atrous Convolution Pyramid Pooling(DASP).FCB reduces the semantic gap between the backbone and the semantic branch.SIDM extracts high-quality semantic information fromthe Transformer branch to reduce computational costs.DASP aggregates information lost in atrous convolutions,effectively capturingmulti-scale objects.Extensive experiments conducted on Cityscapes,CamVid,and ADE20K,achieving an accuracy of 81.7% Mean Intersection over Union(mIoU)at 70.6 Frames Per Second(FPS)on Cityscapes,demonstrate that our method achieves a better balance between accuracy and inference speed.
基金Supported by the Major Science and Technology Project of Hubei Province of China(2022AAA009)the Open Fund of Hubei Luojia Laboratory。
文摘Today,autonomous mobile robots are widely used in all walks of life.Autonomous navigation,as a basic capability of robots,has become a research hotspot.Classical navigation techniques,which rely on pre-built maps,struggle to cope with complex and dynamic environments.With the development of artificial intelligence,learning-based navigation technology have emerged.Instead of relying on pre-built maps,the agent perceives the environment and make decisions through visual observation,enabling end-to-end navigation.A key challenge is to enhance the generalization ability of the agent in unfamiliar environments.To tackle this challenge,it is necessary to endow the agent with spatial intelligence.Spatial intelligence refers to the ability of the agent to transform visual observations into insights,in-sights into understanding,and understanding into actions.To endow the agent with spatial intelligence,relevant research uses scene graph to represent the environment.We refer to this method as scene graph-based object goal navigation.In this paper,we concentrate on scene graph,offering formal description,computational framework of object goal navigation.We provide a comprehensive summary of the meth-ods for constructing and applying scene graph.Additionally,we present experimental evidence that highlights the critical role of scene graph in improving navigation success.This paper also delineates promising research directions,all aimed at sharpening the focus on scene graph.Overall,this paper shows how scene graph endows the agent with spatial intelligence,aiming to promote the importance of scene graph in the field of intelligent navigation.
基金supported in part by the National Natural Science Foundation of China under Grants 62071345。
文摘Self-supervised monocular depth estimation has emerged as a major research focus in recent years,primarily due to the elimination of ground-truth depth dependence.However,the prevailing architectures in this domain suffer from inherent limitations:existing pose network branches infer camera ego-motion exclusively under static-scene and Lambertian-surface assumptions.These assumptions are often violated in real-world scenarios due to dynamic objects,non-Lambertian reflectance,and unstructured background elements,leading to pervasive artifacts such as depth discontinuities(“holes”),structural collapse,and ambiguous reconstruction.To address these challenges,we propose a novel framework that integrates scene dynamic pose estimation into the conventional self-supervised depth network,enhancing its ability to model complex scene dynamics.Our contributions are threefold:(1)a pixel-wise dynamic pose estimation module that jointly resolves the pose transformations of moving objects and localized scene perturbations;(2)a physically-informed loss function that couples dynamic pose and depth predictions,designed to mitigate depth errors arising from high-speed distant objects and geometrically inconsistent motion profiles;(3)an efficient SE(3)transformation parameterization that streamlines network complexity and temporal pre-processing.Extensive experiments on the KITTI and NYU-V2 benchmarks show that our framework achieves state-of-the-art performance in both quantitative metrics and qualitative visual fidelity,significantly improving the robustness and generalization of monocular depth estimation under dynamic conditions.
基金supported by the Zhejiang Provincial Natural Science Foundation of China(No.LQ23F030001)the National Natural Science Foundation of China(No.62406280)+5 种基金the Autism Research Special Fund of Zhejiang Foundation for Disabled Persons(No.2023008)the Liaoning Province Higher Education Innovative Talents Program Support Project(No.LR2019058)the Liaoning Province Joint Open Fund for Key Scientific and Technological Innovation Bases(No.2021-KF-12-05)the Central Guidance on Local Science and Technology Development Fund of Liaoning Province(No.2023JH6/100100066)the Key Laboratory for Biomedical Engineering of Ministry of Education,Zhejiang University,Chinain part by the Open Research Fund of the State Key Laboratory of Cognitive Neuroscience and Learning.
文摘Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.
基金The National Social Science Foundation of China (No.CBA080236)the Graduate Innovation Project of Jiangsu Province (No.CX08B-016R)
文摘As eye tracking can be used to record moment-to-moment changes of eye movements as people inspect pictures of natural scenes and comprehend information, this paper attempts to use eye-movement technology to investigate how the order of presentation and the characteristics of information affect the semantic mismatch effect in the picture-sentence paradigm. A 3(syntax)×2(semantic relation) factorial design is adopted, with syntax and semantic relations as within-participant variables. The experiment finds that the semantic mismatch is most likely to increase cognitive loads as people have to spend more time, including first-pass time, regression path duration, and total fixation duration. Double negation does not significantly increase the processing difficulty of pictures and information. Experimental results show that people can extract the special syntactic strategy from long-term memory to process pictures and sentences with different semantic relations. It enables readers to comprehend double negation as affirmation. These results demonstrate that the constituent comparison model may not be a general model regarding other languages.
文摘[Objective] The aim was to establish a model based on spatial scene similarity, for which soil, slope, transport, water conservancy, light, social economic factors in suitable planting areas were all considered. A new suitable planting area of flue-cured tobacco was determined by comparison and analysis, with consideration of excellent area. [Method] Totaling thirty natural factors were chosen, which were clas- sified into nine categories, from Longpeng Town (LP) and Shaochong Town (SC) in Shiping County in Honghe Hani and Yi Autonomous Prefecture. [Result] According to weights, the factors from high to low were as follows: soil〉light〉elevation〉slope〉 water conservancy〉transport〉baking facility〉planting plans over the years〉others. The similarity of geographical conditions in the area was 0.894 3, which indicated that the planting conditions in the two regions are similar. If farmer population in unit area, farmland quantity for individual farmer, labors in every household, activity in planting flue-cured tobacco and work of local instructor were considered, the weights of different factors were as follows: farmer population in unit area〉farmland quantity for individual farmer〉farmers' activity in planting flue-cured tobacco〉educational back- ground〉labor force in every household〉instructor〉population of farmers' children at- tending school. The similarity of geographical conditions was 0.703 1, which indicated that it is none-natural factors that influence yield and quality of flue-cured tobacco. [Conclusion] According to analysis on suitable planting area of flue-cured tobacco based on assessment of spatial scene similarity, similarity of growing conditions in two spatial scenes can be analyzed and evaluated, which would promote further exploration on, influencing factors and effects on tobacco production.
基金supported by Beijing Natural Science Foundation,China(No.4182020)Open Fund of State Laboratory of Information Engineering in Surveying,Mapping and Remote Sensing,Wuhan University,China(No.17E01)Key Laboratory for Health Monitoring and Control of Large Structures,Shijiazhuang,China(No.KLLSHMC1901)。
文摘A method of multi-block Single Shot Multi Box Detector(SSD)based on small object detection is proposed to the railway scene of unmanned aerial vehicle surveillance.To address the limitation of small object detection,a multi-block SSD mechanism,which consists of three steps,is designed.First,the original input images are segmented into several overlapped patches.Second,each patch is separately fed into an SSD to detect the objects.Third,the patches are merged together through two stages.In the first stage,the truncated object of the sub-layer detection result is spliced.In the second stage,a sub-layer suppression and filtering algorithm applying the concept of non-maximum suppression is utilized to remove the overlapped boxes of sub-layers.The boxes that are not detected in the main-layer are retained.In addition,no sufficient labeled training samples of railway circumstance are available,thereby hindering the deployment of SSD.A two-stage training strategy leveraging to transfer learning is adopted to solve this issue.The deep learning model is preliminarily trained using labeled data of numerous auxiliaries,and then it is refined using only a few samples of railway scene.A railway spot in China,which is easily damaged by landslides,is investigated as a case study.Experimental results show that the proposed multi-block SSD method produces an overall accuracy of 96.6%and obtains an improvement of up to 9.2%compared with the traditional SSD.
基金supported by National Key R&D Program Project of China(No.2016YFB1001004)National Natural Science Foundation of China(Nos.61751308,61603057,61773311)+1 种基金China Postdoctoral Science Foundation(No.2017M613152)Collaborative Research with MSRA
文摘Realizing autonomy is a hot research topic for automatic vehicles in recent years. For a long time, most of the efforts to this goal concentrate on understanding the scenes surrounding the ego-vehicle(autonomous vehicle itself). By completing lowlevel vision tasks, such as detection, tracking and segmentation of the surrounding traffic participants, e.g., pedestrian, cyclists and vehicles, the scenes can be interpreted. However, for an autonomous vehicle, low-level vision tasks are largely insufficient to give help to comprehensive scene understanding. What are and how about the past, the on-going and the future of the scene participants? This deep question actually steers the vehicles towards truly full automation, just like human beings. Based on this thoughtfulness, this paper attempts to investigate the interpretation of traffic scene in autonomous driving from an event reasoning view. To reach this goal, we study the most relevant literatures and the state-of-the-arts on scene representation, event detection and intention prediction in autonomous driving. In addition, we also discuss the open challenges and problems in this field and endeavor to provide possible solutions.