The International Workshop on Data-Mining and Statistical
Science,

September 25-26, 2006, Century Royal Hotel, Sapporo, Japan

Unsupervised Learning of n << p Data and Its Applications to Bioinformatics

Ryo Yoshida, Seiya Imoto, Tomoyuki Higuchi, Satoru Miyano

Microarray gene expression data have a fairly small sample size, usually less than one hundred, whereas the dimension of data is more than several thousands. Under such a situation, the model-based clustering according to a conventional finite mixture distribution might fail due to the occurrence of overfitting arisen in the density estimation. In this paper, we address the problem by extending the classical factor analysis to the mixed factors analysis. The mixed factors model that we propose is stated by an observational equation with the inclusion of the low-dimensional mixed factors being a blind source of clusters. Such statistical modeling offers a parsimonious parameterization of the Gaussian mixture distribution for which the high-dimensional dataset follows from. In this way, we can avoid the overfitting, notably, even if the number of samples is much smaller than the dimension of data. The effectiveness of the mixed factors analysis is demonstrated with the real application to gene expression datasets.