  # Statistical Modeling by the MDL Principle

J. Rissanen
IBM Research Division
Almaden Research Center, DPE-B2/802
San Jose, CA 95120-6099, [E-mail deleted]

 The current interpretation of the MDL principle is to represent a class of models, either given by probability distributions or induced by prediction error criteria, by a single universal model, which enables us to encode the observed data with a code length that is as short as it can be. Such a minimum code length is called the stochastic complexity. The universal model allows us to decompose the data into the learnable information bearing part defined by the optimal model and the rest, which is just noise having no useful information that can be described by the models in the suggested model class. We call the result a universal sufficient statistics decomposition in analogy with the Kolmogorov sufficient statistics decomposition in the algorithmic theory of information. There are several ways to construct universal models, of which we discuss in detail only one, the so-called Normalized Maximum Likelihood(NML) model. For the linear least squares regression problem the NML model with its universal sufficient statistics decomposition can be calculated exactly with a three-fold normalization process. In the important special case of the denoising problem this gives the noise to be removed as that part in the data which cannot be compressed with the suggested models, while the information bearing and hence learnable signal is the compressible part. As a numerical example of denoising we process speech data with wavelets.

\$BLa\$k(B

\$B>pJsM}O@\$H\$=\$N1~MQ3X2q(B(SITA)Home Page\$B\$X(B