Sensor-fusion based navigation attracts significant attentions for its robustness and accuracy in various applications. To achieve a versatile and efficient state estimation both indoor and outdoor, this paper present...Sensor-fusion based navigation attracts significant attentions for its robustness and accuracy in various applications. To achieve a versatile and efficient state estimation both indoor and outdoor, this paper presents an improved monocular visual inertial navigation architecture within the Multi-State Constraint Kalman Filter (MSCKF). In addition, to alleviate the initialization demands by appending enough stable poses in MSCKF, a rapid and robust Initialization MSCKF (I-MSCKF) navigation method is proposed in the paper. Based on the trifocal tensor and sigmapoint filter, the initialization of the integrated navigation can be accomplished within three consecutive visual frames. Thus, the proposed I-MSCKF method can improve the navigation performance when suffered from shocks at the initial stage. Moreover, the sigma-point filter is applied at initial stage to improve the accuracy for state estimation. The state vector generated at initial stage from the proposed method is consistent with MSCKF, and thus a seamless transition can be achieved between the initialization and the subsequent navigation in I-MSCKF. Finally, the experimental results show that the proposed I-MSCKF method can improve the robustness and accuracy for monocular visual inertial navigations.展开更多
Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐vi...Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.展开更多
Novel view synthesis has attracted tremendous research attention recently for its applications in virtual reality and immersive telepresence.Rendering a locally immersive light field(LF)based on arbitrary large baseli...Novel view synthesis has attracted tremendous research attention recently for its applications in virtual reality and immersive telepresence.Rendering a locally immersive light field(LF)based on arbitrary large baseline RGB references is a challenging problem that lacks efficient solutions with existing novel view synthesis techniques.In this work,we aim at truthfully rendering local immersive novel views/LF images based on large baseline LF captures and a single RGB image in the target view.To fully explore the precious information from source LF captures,we propose a novel occlusion-aware source sampler(OSS)module which efficiently transfers the pixels of source views to the target view′s frustum in an occlusion-aware manner.An attention-based deep visual fusion module is proposed to fuse the revealed occluded background content with a preliminary LF into a final refined LF.The proposed source sampling and fusion mechanism not only helps to provide information for occluded regions from varying observation angles,but also proves to be able to effectively enhance the visual rendering quality.Experimental results show that our proposed method is able to render high-quality LF images/novel views with sparse RGB references and outperforms state-of-the-art LF rendering and novel view synthesis methods.展开更多
基金the supports of the Beijing Key Laboratory of Digital Design&Manufacturethe Academic Excellence Foundation of Beihang University for Ph.D.Studentsthe MIIT(Ministry of Industry and Information Technology)Key Laboratory of Smart Manufacturing for High-end Aerospace Products
文摘Sensor-fusion based navigation attracts significant attentions for its robustness and accuracy in various applications. To achieve a versatile and efficient state estimation both indoor and outdoor, this paper presents an improved monocular visual inertial navigation architecture within the Multi-State Constraint Kalman Filter (MSCKF). In addition, to alleviate the initialization demands by appending enough stable poses in MSCKF, a rapid and robust Initialization MSCKF (I-MSCKF) navigation method is proposed in the paper. Based on the trifocal tensor and sigmapoint filter, the initialization of the integrated navigation can be accomplished within three consecutive visual frames. Thus, the proposed I-MSCKF method can improve the navigation performance when suffered from shocks at the initial stage. Moreover, the sigma-point filter is applied at initial stage to improve the accuracy for state estimation. The state vector generated at initial stage from the proposed method is consistent with MSCKF, and thus a seamless transition can be achieved between the initialization and the subsequent navigation in I-MSCKF. Finally, the experimental results show that the proposed I-MSCKF method can improve the robustness and accuracy for monocular visual inertial navigations.
基金supported by the National Key R&D Program of China(No.2020AAA0108904)the Science and Technology Plan of Shenzhen(No.JCYJ20200109140410340).
文摘Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.
基金the Theme-based Research Scheme,Research Grants Council of Hong Kong(No.T45-205/21-N).
文摘Novel view synthesis has attracted tremendous research attention recently for its applications in virtual reality and immersive telepresence.Rendering a locally immersive light field(LF)based on arbitrary large baseline RGB references is a challenging problem that lacks efficient solutions with existing novel view synthesis techniques.In this work,we aim at truthfully rendering local immersive novel views/LF images based on large baseline LF captures and a single RGB image in the target view.To fully explore the precious information from source LF captures,we propose a novel occlusion-aware source sampler(OSS)module which efficiently transfers the pixels of source views to the target view′s frustum in an occlusion-aware manner.An attention-based deep visual fusion module is proposed to fuse the revealed occluded background content with a preliminary LF into a final refined LF.The proposed source sampling and fusion mechanism not only helps to provide information for occluded regions from varying observation angles,but also proves to be able to effectively enhance the visual rendering quality.Experimental results show that our proposed method is able to render high-quality LF images/novel views with sparse RGB references and outperforms state-of-the-art LF rendering and novel view synthesis methods.