Better and Faster: Knowledge Transfer from Multiple Self-supervised Learning Tasks via Graph Distillation for Video Classification
Better and Faster: Knowledge Transfer from Multiple Self-supervised Learning Tasks via Graph Distillation for Video Classification
Chenrui Zhang, Yuxin Peng
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Main track. Pages 1135-1141.
https://doi.org/10.24963/ijcai.2018/158
Video representation learning is a vital problem for
classification task. Recently, a promising unsupervised
paradigm termed self-supervised learning
has emerged, which explores inherent supervisory
signals implied in massive data for feature learning
via solving auxiliary tasks. However, existing
methods in this regard suffer from two limitations
when extended to video classification. First, they
focus only on a single task, whereas ignoring complementarity
among different task-specific features
and thus resulting in suboptimal video representation.
Second, high computational and memory cost
hinders their application in real-world scenarios. In
this paper, we propose a graph-based distillation
framework to address these problems: (1) We propose
logits graph and representation graph to transfer
knowledge from multiple self-supervised tasks,
where the former distills classifier-level knowledge
by solving a multi-distribution joint matching problem,
and the latter distills internal feature knowledge
from pairwise ensembled representations with
tackling the challenge of heterogeneity among different
features; (2) The proposal that adopts a
teacher-student framework can reduce the redundancy
of knowledge learned from teachers dramatically,
leading to a lighter student model that solves
classification task more efficiently. Experimental
results on 3 video datasets validate that our proposal
not only helps learn better video representation
but also compress model for faster inference.
Keywords:
Computer Vision: Recognition: Detection, Categorization, Indexing, Matching, Retrieval, Semantic Interpretation
Computer Vision: Structural and Model-Based Approaches, Knowledge Representation and Reasoning