Multimodal Transformer Networks for Pedestrian Trajectory Prediction

Multimodal Transformer Networks for Pedestrian Trajectory Prediction

Ziyi Yin, Ruijin Liu, Zhiliang Xiong, Zejian Yuan

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
Main Track. Pages 1259-1265. https://doi.org/10.24963/ijcai.2021/174

We consider the problem of forecasting the future locations of pedestrians in an ego-centric view of a moving vehicle. Current CNNs or RNNs are flawed in capturing the high dynamics of motion between pedestrians and the ego-vehicle, and suffer from the massive parameter usages due to the inefficiency of learning long-term temporal dependencies. To address these issues, we propose an efficient multimodal transformer network that aggregates the trajectory and ego-vehicle speed variations at a coarse granularity and interacts with the optical flow in a fine-grained level to fill the vacancy of highly dynamic motion. Specifically, a coarse-grained fusion stage fuses the information between trajectory and ego-vehicle speed modalities to capture the general temporal consistency. Meanwhile, a fine-grained fusion stage merges the optical flow in the center area and pedestrian area, which compensates the highly dynamic motion of ego-vehicle and target pedestrian. Besides, the whole network is only attention-based that can efficiently model long-term sequences for better capturing the temporal variations. Our multimodal transformer is validated on the PIE and JAAD datasets and achieves state-of-the-art performance with the most light-weight model size. The codes are available at https://github.com/ericyinyzy/MTN_trajectory.
Keywords:
Computer Vision: 2D and 3D Computer Vision
Computer Vision: Structural and Model-Based Approaches, Knowledge Representation and Reasoning
Multidisciplinary Topics and Applications: Transportation