Event extraction extracts event frames from text, while grounded situation recognition detects events in images. As real-world applications frequently encounter a multitude of unforeseen events, certain researchers ha...Event extraction extracts event frames from text, while grounded situation recognition detects events in images. As real-world applications frequently encounter a multitude of unforeseen events, certain researchers have introduced cross-domain and in-domain event extraction, while grounded situation recognition primarily explores in-domain scenarios. Therefore, in this paper, we propose cross-domain grounded situation recognition and establish a new benchmark SWiG-XD. In this more challenging setting, we deepen the connection between the two tasks based on their underlying unity in two different modalities and explore how to transfer the generalization ability from text to images. Firstly, we utilize ChatGPT to automatically generate textual data, which can be divided into two categories. One category is directly matched with images, establishing a direct connection with the images. The other category encompasses all event types and possesses greater generalization. Then we employ a unified model framework to establish the association between textual concepts and local image features and achieve cross-domain generalization transfer across modalities through modality-shared prompts and self-attention mechanism. Furthermore, we incorporate textual data with higher generalization to further assist in improving generalization on images. The experimental results on the newly constructed benchmark demonstrate the effectiveness of our method.展开更多
基金supported by National Natural Science Foundation of China(No.62176058)National Key RD Program of China(2023YFF1204800).
文摘Event extraction extracts event frames from text, while grounded situation recognition detects events in images. As real-world applications frequently encounter a multitude of unforeseen events, certain researchers have introduced cross-domain and in-domain event extraction, while grounded situation recognition primarily explores in-domain scenarios. Therefore, in this paper, we propose cross-domain grounded situation recognition and establish a new benchmark SWiG-XD. In this more challenging setting, we deepen the connection between the two tasks based on their underlying unity in two different modalities and explore how to transfer the generalization ability from text to images. Firstly, we utilize ChatGPT to automatically generate textual data, which can be divided into two categories. One category is directly matched with images, establishing a direct connection with the images. The other category encompasses all event types and possesses greater generalization. Then we employ a unified model framework to establish the association between textual concepts and local image features and achieve cross-domain generalization transfer across modalities through modality-shared prompts and self-attention mechanism. Furthermore, we incorporate textual data with higher generalization to further assist in improving generalization on images. The experimental results on the newly constructed benchmark demonstrate the effectiveness of our method.