E3SN: Efficient End-to-End Siamese Network for Video Object Segmentation

E3SN: Efficient End-to-End Siamese Network for Video Object Segmentation

Meng Lan, Yipeng Zhang, Qinning Xu, Lefei Zhang

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence
Main track. Pages 701-707. https://doi.org/10.24963/ijcai.2020/98

In the semi-supervised video object segmentation (VOS) field, SiamMask has achieved competitive accuracy and the fastest running speed. However, the two-stage training procedure requires additional manual intervention, and using only single-level features does not maximize the rich hierarchical feature information. This paper proposes an efficient end-to-end Siamese network for VOS. In particular, a supervised sampling strategy is designed to optimize the training procedure. Such an optimization facilitates the training of the entire model in an end-to-end manner. Moreover, a multilevel feature aggregation module is developed to enhance feature representability and improve segmentation accuracy. Experimental results on DAVIS2016 and DAVIS2017 datasets show that the proposed approach outperforms the SiamMask in accuracy with similar FPS. Moreover, this approach also achieves good accuracy-speed trade-off compared with that of other state-of-the-art VOS algorithms.
Keywords:
Computer Vision: Motion and Tracking
Computer Vision: Recognition: Detection, Categorization, Indexing, Matching, Retrieval, Semantic Interpretation