Although VSLAM/VISLAM has achieved great success,it is still difficult to quantitatively evaluate the localization results of different kinds of SLAM systems from the aspect of augmented reality due to the lack of an ...Although VSLAM/VISLAM has achieved great success,it is still difficult to quantitatively evaluate the localization results of different kinds of SLAM systems from the aspect of augmented reality due to the lack of an appropriate benchmark.For AR applications in practice,a variety of challenging situations(e.g.,fast motion,strong rotation,serious motion blur,dynamic interference)may be easily encountered since a home user may not carefully move the AR device,and the real environment may be quite complex.In addition,the frequency of camera lost should be minimized and the recovery from the failure status should be fast and accurate for good AR experience.Existing SLAM datasets/benchmarks generally only provide the evaluation of pose accuracy and their camera motions are somehow simple and do not fit well the common cases in the mobile AR applications.With the above motivation,we build a new visual-inertial dataset as well as a series of evaluation criteria for AR.We also review the existing monocular VSLAM/VISLAM approaches with detailed analyses and comparisons.Especially,we select 8 representative monocular VSLAM/VISLAM approaches/systems and quantitatively evaluate them on our benchmark.Our dataset,sample code and corresponding evaluation tools are available at the benchmark website http://www.zjucvg.net/eval-vislam/.展开更多
Deep-learning methods provide a promising approach for measuring in-vivo knee joint motion from fast registration of two-dimensional(2D)to three-dimensional(3D)data with a broad range of capture.However,if there are i...Deep-learning methods provide a promising approach for measuring in-vivo knee joint motion from fast registration of two-dimensional(2D)to three-dimensional(3D)data with a broad range of capture.However,if there are insufficient data for training,the data-driven approach will fail.We propose a feature-based transfer-learning method to extract features from fluoroscopic images.With three subjects and fewer than 100 pairs of real fluoroscopic images,we achieved a mean registration success rate of up to 40%.The proposed method provides a promising solution,using a learning-based registration method when only a limited number of real fluoroscopic images is available.展开更多
Transformers have recently lead to encouraging progress in computer vision.In this work,we present new baselines by improving the original Pyramid Vision Transformer(PVT v1)by adding three designs:(i)a linear complexi...Transformers have recently lead to encouraging progress in computer vision.In this work,we present new baselines by improving the original Pyramid Vision Transformer(PVT v1)by adding three designs:(i)a linear complexity attention layer,(ii)an overlapping patch embedding,and(iii)a convolutional feed-forward network.With these modifications,PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification,detection,and segmentation.In particular,PVT v2 achieves comparable or better performance than recent work such as the Swin transformer.We hope this work will facilitate state-ofthe-art transformer research in computer vision.Code is available at https://github.com/whai362/PVT.展开更多
The real-time application of artificial intelligence(AI)technologies in sports is a long-standing challenge owing to large spatial sports field,complexity,and uncertainty of real-world environment,etc.
Sixth generation(6G)enabled edge intelligence opens up a new era of Internet of everything and makes it possible to interconnect people-devices-cloud anytime,anywhere.More and more next-generation wireless network sma...Sixth generation(6G)enabled edge intelligence opens up a new era of Internet of everything and makes it possible to interconnect people-devices-cloud anytime,anywhere.More and more next-generation wireless network smart service applications are changing our way of life and improving our quality of life.As the hottest new form of next-generation Internet applications,Metaverse is striving to connect billions of users and create a shared world where virtual and reality merge.However,limited by resources,computing power,and sensory devices,Metaverse is still far from realizing its full vision of immersion,materialization,and interoperability.To this end,this survey aims to realize this vision through the organic integration of 6G-enabled edge artificial intelligence(AI)and Metaverse.Specifically,we first introduce three new types of edge-Metaverse architectures that use 6G-enabled edge AI to solve resource and computing constraints in Metaverse.Then we summarize technical challenges that these architectures face in Metaverse and the existing solutions.Furthermore,we explore how the edge-Metaverse architecture technology helps Metaverse to interact and share digital data.Finally,we discuss future research directions to realize the true vision of Metaverse with 6G-enabled edge AI.展开更多
Facial mesh tracking enables the production of topologically consistent 3D facial meshes from stereo video input captured by calibrated cameras.This technology is an integral part of many digital human applications,su...Facial mesh tracking enables the production of topologically consistent 3D facial meshes from stereo video input captured by calibrated cameras.This technology is an integral part of many digital human applications,such as personalized avatar creation,audio-driven 3D facial animation,and talking face video generation.Currently,most facial mesh tracking methods are built on computer graphics techniques,which involve complex procedures and often necessitate human annotation within pipelines.As a result,these approaches are difficult to implement and hard to generalize across various scenarios.We propose a backpropagation-based solution that formulates facial mesh tracking as a differentiable optimization problem called the BPMT.Our solution leverages visual clues extracted from the stereo input to estimate vertex-wise geometry and texture information.The BPMT is composed of two steps:automatic face analysis and mesh tracking.In the first step,a range of visual clues are automatically extracted from the input,including facial point clouds,multi-view 2D landmarks,3D landmarks in the world coordinate system,motion fields,and image masks.The second step can be viewed as a differentiable optimization problem,with constraints comprising stereo video input and facial clues.The primary objective is to achieve topologically consistent 3D facial meshes across frames.Additionally,the parameters to be optimized encompass the positions of free-form deformed vertices and a shared texture UV map.Furthermore,the 3D morphable model(3DMM)is introduced as a form of regularization to enhance the convergence of the optimization process.Leveraging the fully developed backpropagation software,we progressively register the facial meshes to the recorded object,generating high-quality 3D faces with consistent topologies.The BPMT requires no manual labeling within the pipeline,making it suitable for producing large-scale stereo facial data.Moreover,our method exhibits a high degree of flexibility and extensibility,positioning it as a promising platform for future research in the community.展开更多
Multi-modal large language models(MLLMs)have demonstrated impressive performance in vision-language tasks across a wide range of domains.However,the large model scale and associated high computational cost pose signif...Multi-modal large language models(MLLMs)have demonstrated impressive performance in vision-language tasks across a wide range of domains.However,the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices,thereby hindering their widespread application.In this work,we introduce Mini-InternVL,a series of MLLMs with parameters ranging from 1 billion to 4 billion,which achieves 90% of the performance with only 5% of the parameters.This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios.To further promote the adoption of our models,we are developing a unified adaptation framework for Mini-InternVL,which enables our models to transfer and outperform specialized models in downstream tasks,including autonomous driving,medical image processing,and remote sensing.We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs.展开更多
基金the National Key Research and Development Program of China(2016YFB1001501)NSF of China(61672457)+1 种基金the Fundamental Research Funds for the Central Universities(2018FZA5011)Zhejiang University-SenseTime Joint Lab of 3D Vision.
文摘Although VSLAM/VISLAM has achieved great success,it is still difficult to quantitatively evaluate the localization results of different kinds of SLAM systems from the aspect of augmented reality due to the lack of an appropriate benchmark.For AR applications in practice,a variety of challenging situations(e.g.,fast motion,strong rotation,serious motion blur,dynamic interference)may be easily encountered since a home user may not carefully move the AR device,and the real environment may be quite complex.In addition,the frequency of camera lost should be minimized and the recovery from the failure status should be fast and accurate for good AR experience.Existing SLAM datasets/benchmarks generally only provide the evaluation of pose accuracy and their camera motions are somehow simple and do not fit well the common cases in the mobile AR applications.With the above motivation,we build a new visual-inertial dataset as well as a series of evaluation criteria for AR.We also review the existing monocular VSLAM/VISLAM approaches with detailed analyses and comparisons.Especially,we select 8 representative monocular VSLAM/VISLAM approaches/systems and quantitatively evaluate them on our benchmark.Our dataset,sample code and corresponding evaluation tools are available at the benchmark website http://www.zjucvg.net/eval-vislam/.
基金sponsored by the National Natural Science Foundation of China(31771017,31972924,81873997)the Science and Technology Commission of Shanghai Municipality(16441908700)+3 种基金the Innovation Research Plan supported by Shanghai Municipal Education Commission(ZXWF082101)the National Key R&D Program of China(2017YFC0110700,2018YFF0300504,2019YFC0120600)the Natural Science Foundation of Shanghai(18ZR1428600)the Interdisciplinary Program of Shanghai Jiao Tong University(ZH2018QNA06,YG2017MS09).
文摘Deep-learning methods provide a promising approach for measuring in-vivo knee joint motion from fast registration of two-dimensional(2D)to three-dimensional(3D)data with a broad range of capture.However,if there are insufficient data for training,the data-driven approach will fail.We propose a feature-based transfer-learning method to extract features from fluoroscopic images.With three subjects and fewer than 100 pairs of real fluoroscopic images,we achieved a mean registration success rate of up to 40%.The proposed method provides a promising solution,using a learning-based registration method when only a limited number of real fluoroscopic images is available.
基金National Natural Science Foundation of China under Grant Nos.61672273 and 61832008Science Foundation for Distinguished Young Scholars of Jiangsu under Grant No.BK20160021+1 种基金Postdoctoral Innovative Talent Support Program of China under Grant Nos.BX20200168,2020M681608General Research Fund of Hong Kong under Grant No.27208720。
文摘Transformers have recently lead to encouraging progress in computer vision.In this work,we present new baselines by improving the original Pyramid Vision Transformer(PVT v1)by adding three designs:(i)a linear complexity attention layer,(ii)an overlapping patch embedding,and(iii)a convolutional feed-forward network.With these modifications,PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification,detection,and segmentation.In particular,PVT v2 achieves comparable or better performance than recent work such as the Swin transformer.We hope this work will facilitate state-ofthe-art transformer research in computer vision.Code is available at https://github.com/whai362/PVT.
基金The National Key Research and Development Program of China grant 2020YFF0304300 supported this study.
文摘The real-time application of artificial intelligence(AI)technologies in sports is a long-standing challenge owing to large spatial sports field,complexity,and uncertainty of real-world environment,etc.
基金Provincial Natural Science Foundation of China(LH2020F044)2019-“Chunhui”Plan Cooperative Scientific Research Project of the Ministry of Education of China(HLJ2019015)+2 种基金Fundamental Research Funds for Heilongjiang University,China(2020-KYYWF-1014)NSFC(62102099)National Key R&D Program of China(2018YFE0205503)。
文摘Sixth generation(6G)enabled edge intelligence opens up a new era of Internet of everything and makes it possible to interconnect people-devices-cloud anytime,anywhere.More and more next-generation wireless network smart service applications are changing our way of life and improving our quality of life.As the hottest new form of next-generation Internet applications,Metaverse is striving to connect billions of users and create a shared world where virtual and reality merge.However,limited by resources,computing power,and sensory devices,Metaverse is still far from realizing its full vision of immersion,materialization,and interoperability.To this end,this survey aims to realize this vision through the organic integration of 6G-enabled edge artificial intelligence(AI)and Metaverse.Specifically,we first introduce three new types of edge-Metaverse architectures that use 6G-enabled edge AI to solve resource and computing constraints in Metaverse.Then we summarize technical challenges that these architectures face in Metaverse and the existing solutions.Furthermore,we explore how the edge-Metaverse architecture technology helps Metaverse to interact and share digital data.Finally,we discuss future research directions to realize the true vision of Metaverse with 6G-enabled edge AI.
基金supported by the National Natural Science Foundation of China(Nos.62176256 and U23B2054)the Beijing Science and Technology Plan Project(No.Z231100005923033)+3 种基金the Beijing Natural Science Foundation(No.L221013)the Youth Innovation Promotion Associationthe Chinese Academy of Sciences(No.Y2021131)the InnoHK program.
文摘Facial mesh tracking enables the production of topologically consistent 3D facial meshes from stereo video input captured by calibrated cameras.This technology is an integral part of many digital human applications,such as personalized avatar creation,audio-driven 3D facial animation,and talking face video generation.Currently,most facial mesh tracking methods are built on computer graphics techniques,which involve complex procedures and often necessitate human annotation within pipelines.As a result,these approaches are difficult to implement and hard to generalize across various scenarios.We propose a backpropagation-based solution that formulates facial mesh tracking as a differentiable optimization problem called the BPMT.Our solution leverages visual clues extracted from the stereo input to estimate vertex-wise geometry and texture information.The BPMT is composed of two steps:automatic face analysis and mesh tracking.In the first step,a range of visual clues are automatically extracted from the input,including facial point clouds,multi-view 2D landmarks,3D landmarks in the world coordinate system,motion fields,and image masks.The second step can be viewed as a differentiable optimization problem,with constraints comprising stereo video input and facial clues.The primary objective is to achieve topologically consistent 3D facial meshes across frames.Additionally,the parameters to be optimized encompass the positions of free-form deformed vertices and a shared texture UV map.Furthermore,the 3D morphable model(3DMM)is introduced as a form of regularization to enhance the convergence of the optimization process.Leveraging the fully developed backpropagation software,we progressively register the facial meshes to the recorded object,generating high-quality 3D faces with consistent topologies.The BPMT requires no manual labeling within the pipeline,making it suitable for producing large-scale stereo facial data.Moreover,our method exhibits a high degree of flexibility and extensibility,positioning it as a promising platform for future research in the community.
基金supported by the National Key R&D Program of China(Nos.2022ZD0160102 and 2022ZD0161300)the National Natural Science Foundation of China(Nos.62376134 and 62372223).
文摘Multi-modal large language models(MLLMs)have demonstrated impressive performance in vision-language tasks across a wide range of domains.However,the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices,thereby hindering their widespread application.In this work,we introduce Mini-InternVL,a series of MLLMs with parameters ranging from 1 billion to 4 billion,which achieves 90% of the performance with only 5% of the parameters.This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios.To further promote the adoption of our models,we are developing a unified adaptation framework for Mini-InternVL,which enables our models to transfer and outperform specialized models in downstream tasks,including autonomous driving,medical image processing,and remote sensing.We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs.