Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions...Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.展开更多
3-D task space in modeling and animation is usually reduced to the separate control dimensions supported by conventional interactive devices. This limitation maps only patial view of the problem to the device space at...3-D task space in modeling and animation is usually reduced to the separate control dimensions supported by conventional interactive devices. This limitation maps only patial view of the problem to the device space at a time, and results in tedious and un natural interface of control. This paper uses the DataGlove interface for modeling and animating scene behaviors. The modeling interface selects, scales, rotates, translates,copies and deletes the instances of the prindtives. These basic modeling processes are directly performed in the task spacet using hand shapes and motions. Hand shapes are recoginzed as discrete states that trigger the commands, and hand motion are mapped to the movement of a selected instance. The interactions through hand interface place the user as a participant in the process of behavior simulation. Both event triggering and role switching of hand are experimented in simulation. The event mode of hand triggers control signals or commands through a menu interface. The object mode of hand simulates itself as an object whose appearance or motion inlluences the motions of other objects in scene. The involvement of hand creates a diversity of dyndric situations for testing variable scene behaviors. Our experiments have shown the potential use of this interface directly in the 3-D modeling and animation task space.展开更多
基金supported by the Zhejiang Provincial Natural Science Foundation of China(No.LQ23F030001)the National Natural Science Foundation of China(No.62406280)+5 种基金the Autism Research Special Fund of Zhejiang Foundation for Disabled Persons(No.2023008)the Liaoning Province Higher Education Innovative Talents Program Support Project(No.LR2019058)the Liaoning Province Joint Open Fund for Key Scientific and Technological Innovation Bases(No.2021-KF-12-05)the Central Guidance on Local Science and Technology Development Fund of Liaoning Province(No.2023JH6/100100066)the Key Laboratory for Biomedical Engineering of Ministry of Education,Zhejiang University,Chinain part by the Open Research Fund of the State Key Laboratory of Cognitive Neuroscience and Learning.
文摘Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.
文摘3-D task space in modeling and animation is usually reduced to the separate control dimensions supported by conventional interactive devices. This limitation maps only patial view of the problem to the device space at a time, and results in tedious and un natural interface of control. This paper uses the DataGlove interface for modeling and animating scene behaviors. The modeling interface selects, scales, rotates, translates,copies and deletes the instances of the prindtives. These basic modeling processes are directly performed in the task spacet using hand shapes and motions. Hand shapes are recoginzed as discrete states that trigger the commands, and hand motion are mapped to the movement of a selected instance. The interactions through hand interface place the user as a participant in the process of behavior simulation. Both event triggering and role switching of hand are experimented in simulation. The event mode of hand triggers control signals or commands through a menu interface. The object mode of hand simulates itself as an object whose appearance or motion inlluences the motions of other objects in scene. The involvement of hand creates a diversity of dyndric situations for testing variable scene behaviors. Our experiments have shown the potential use of this interface directly in the 3-D modeling and animation task space.