Balancing Multimodal Learning via Online Logit Modulation

Balancing Multimodal Learning via Online Logit Modulation

Daoming Zong, Chaoyue Ding, Baoxiang Li, Jiakui Li, Ken Zheng

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 5753-5761. https://doi.org/10.24963/ijcai.2024/636

Multimodal learning is provably superior to unimodal learning. However, in practice, the best-performing unimodal networks often outperform jointly trained multimodal networks. This phenomenon can be attributed to the varying convergence and generalization rates across different modalities, leading to the dominance of one modality and causing underfitting of other modalities in simple multimodal joint training. To mitigate this issue, we propose two key ingredients: i) disentangling the learning of unimodal features and multimodal interaction through an intermediate representation fusion block; ii) modulating the logits of different modalities via dynamic coefficients during training to align their magnitudes with the target values, referred to as online logit modulation (OLM). Remarkably, OLM is model-agnostic and can be seamlessly integrated with most existing multimodal training frameworks. Empirical evidence shows that our approach brings significant enhancements over baselines on a wide range of multimodal tasks, covering video, audio, text, image, and depth modalities.
Keywords:
Machine Learning: ML: Optimization
Computer Vision: CV: Multimodal learning
Machine Learning: ML: Applications
Machine Learning: ML: Attention models