[Back to DMSS2006 Program] [DMSS2006 homepage]

The International Workshop on Data-Mining and Statistical Science,
September 25-26, 2006, Century Royal Hotel, Sapporo, Japan

DMSS2006 invited talk: Ryo Yoshida

Monday, September 25, 2006


Unsupervised Learning of n << p Data and Its Applications to Bioinformatics


Ryo Yoshida, Seiya Imoto, Tomoyuki Higuchi, Satoru Miyano


Microarray gene expression data have a fairly small sample size, usually less than one hundred, whereas the dimension of data is more than several thousands. Under such a situation, the model-based clustering according to a conventional finite mixture distribution might fail due to the occurrence of overfitting arisen in the density estimation. In this paper, we address the problem by extending the classical factor analysis to the mixed factors analysis. The mixed factors model that we propose is stated by an observational equation with the inclusion of the low-dimensional mixed factors being a blind source of clusters. Such statistical modeling offers a parsimonious parameterization of the Gaussian mixture distribution for which the high-dimensional dataset follows from. In this way, we can avoid the overfitting, notably, even if the number of samples is much smaller than the dimension of data. The effectiveness of the mixed factors analysis is demonstrated with the real application to gene expression datasets.

Home page:

Ryo Yoshida's Home page at IMS