Generalized Clustergrams for Overlapping Biclusters

Many real-life datasets, such as those produced by gene expression studies, exhibit complex substructures at various levels of granularity and thus do not have unique well-defined numbers of clusters. In such cases, it is important to be able to trace the evolution of the individual clusters as the number of dimensions of the clustering is varied. While the dendrograms produced by bottom-up clustering methods such as hierarchical clustering are very useful for this purpose, the approach is known to produce unreliable clusters due to its instability w.r.t. resampling. Moreover, hierarchical clustering does not apply to overlapping (bi)clusters, such as those obtained in gene expression studies. On the other hand, the instability w.r.t. the initialization of top-down methods, such as k-means, prevents the comparison between clusters obtained at different dimensionalities. In this paper, we present a method for constructing generalized dendrograms for overlapping biclusters, which depict the evolution of the biclusters as their number is varied. An essential ingredient is a stable biclustering method based on positive tensor factorization of a number of nonnegative matrix factorization runs. We apply our approach to a large colon cancer dataset, which shows several distinct subclasses whose dimensional evolution must be carefully analyzed to enable a more meaningful biological interpretation and sub-classification.

Liviu Badea