GTR: A Grafting-Then-Reassembling Framework for Dynamic Scene Graph Generation

GTR: A Grafting-Then-Reassembling Framework for Dynamic Scene Graph Generation

Jiafeng Liang, Yuxin Wang, Zekun Wang, Ming Liu, Ruiji Fu, Zhongyuan Wang, Bing Qin

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Main Track. Pages 1177-1185. https://doi.org/10.24963/ijcai.2023/131

Dynamic scene graph generation aims to identify visual relationships (subject-predicate-object) in frames based on spatio-temporal contextual information in the video. Previous work implicitly models the spatio-temporal interaction simultaneously, which leads to entanglement of spatio-temporal contextual information. To this end, we propose a Grafting-Then-Reassembling framework (GTR), which explicitly extracts intra-frame spatial information and inter-frame temporal information in two separate stages to decouple spatio-temporal contextual information. Specifically, we first graft a static scene graph generation model to generate static visual relationships within frames. Then we propose the temporal dependency model to extract the temporal dependencies across frames, and explicitly reassemble static visual relationships into dynamic scene graphs. Experimental results show that GTR achieves the state-of-the-art performance on Action Genome dataset. Further analyses reveal that the reassembling stage is crucial to the success of our framework.
Keywords:
Computer Vision: CV: Video analysis and understanding   
Natural Language Processing: NLP: Information extraction