The Orthogonality of Weight Vectors: The Key Characteristics of Normalization and Residual Connections

The Orthogonality of Weight Vectors: The Key Characteristics of Normalization and Residual Connections

Zhixing Lu, Yuanyuan Sun, Zhihao Yang, Qin Zhou, Hongfei Lin

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 4687-4695.

Normalization and residual connections find extensive application within the intricate architecture of deep neural networks, contributing significantly to their heightened performance. Nevertheless, the precise factors responsible for this elevated performance have remained elusive. Our theoretical investigations have unveiled a noteworthy revelation: the utilization of normalization and residual connections results in an enhancement of the orthogonality within the weight vectors of deep neural networks. This, in turn, induces the Gram matrix of neural network weights to exhibit a pronounced tendency towards strict diagonal dominance, thereby amplifying the neural network's capacity for feature learning. Meanwhile, we have designed the parameters independence index (PII) to precisely characterize the orthogonality of parameter vectors. In tandem with our theoretical findings, we undertook empirical validations through experiments conducted on prevalent network models, including fully connected networks (FNNs), convolutional neural networks (CNNs), Transformers, pre-trained language models(PLMs) and large language models (LLMs) composed of Transformers. Finally, we have found that a fine-tuning technique (LoRA) preserves the orthogonality of parameter vectors, a revelation that carries importance within the framework of fine-tuning techniques for LLMs.
Machine Learning: ML: Explainable/Interpretable machine learning
Machine Learning: ML: Deep learning architectures