D3ETR: Decoder Distillation for Detection Transformer

D3ETR: Decoder Distillation for Detection Transformer

Xiaokang Chen, Jiahui Chen, Yan Liu, Jiaxiang Tang, Gang Zeng

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 668-676. https://doi.org/10.24963/ijcai.2024/74

Although various knowledge distillation (KD) methods for CNN-based detectors have been proven effective in improving small students, build- ing baselines and recipes for DETR-based detec- tors remains a challenge. This paper concentrates on the transformer decoder of DETR-based detec- tors and explores KD methods suitable for them. However, the random order of the decoder outputs poses a challenge for knowledge distillation as it provides no direct correspondence between the pre- dictions of the teacher and the student. To this end, we propose MixMatcher that aligns the de- coder outputs of DETR-based teacher and student, by mixing two teacher-student matching strategies for combined advantages. The first strategy, Adap- tive Matching, applies bipartite matching to adap- tively match the outputs of the teacher and the stu- dent in each decoder layer. The second strategy, Fixed Matching, fixes the correspondence between the outputs of the teacher and the student with the same object queries as input, which alleviates in- stability of bipartite matching in Adaptive Match- ing. Using both strategies together produces bet- ter results than using either strategy alone. Based on MixMatcher, we devise Decoder Distillation for DEtection TRansformer (D3ETR), which dis- tills knowledge in decoder predictions and attention maps from the teacher to student. D3ETR shows superior performance on various DETR-based de- tectors with different backbones. For instance, D3ETR improves Conditional DETR-R50-C5 by 8.3 mAP under 12 epochs training setting with Conditional DETR-R101-C5 serving as the teacher. The code will be released.
Keywords:
Computer Vision: CV: Applications
Computer Vision: CV: Recognition (object detection, categorization)