The Poisson transform for unnormalised statistical models


Statistics and Computing

July 2015, vol. 25, n°4, pp.767-780

Departments: Economics & Decision Sciences

Contrary to standard statistical models, unnormalised statistical models only specify the likelihood function up to a constant. While such models are natural and popular, the lack of normalisation makes inference much more difficult. Extending classical results on the multinomial-Poisson transform (Baker In: J Royal Stat Soc 43(4):495–504, 1994), we show that inferring the parameters of a unnormalised model on a space Ω can be mapped onto an equivalent problem of estimating the intensity of a Poisson point process on Ω. The unnormalised statistical model now specifies an intensity function that does not need to be normalised. Effectively, the normalisation constant may now be inferred as just another parameter, at no loss of information. The result can be extended to cover non-IID models, which includes for example unnormalised models for sequences of graphs (dynamical graphs), or for sequences of binary vectors. As a consequence, we prove that unnormalised parameteric inference in non-IID models can be turned into a semi-parametric estimation problem. Moreover, we show that the noise-contrastive estimation method of Gutmann and Hyvärinen (J Mach Learn Res 13(1):307–361, 2012) can be understood as an approximation of the Poisson transform, and extended to non-IID settings. We use our results to fit spatial Markov chain models of eye movements, where the Poisson transform allows us to turn a highly non-standard model into vanilla semi-parametric logistic regression.Unnormalised statistical models are a core tool in modern machine learning, especially deep learning (Salakhutdinov and Hinton 2009), computer vision (Markov random fields, Wang et al. 2013) and statistics for point processes (Gu and Zhu 2001), network models (Caimo and Friel 2011) and directional data (Walker 2011). They appear naturally whenever one can best describe data as having to conform to certain features: we may then define an energy function that measures how well the data conform to these constraints. While this way of formulating statistical models is extremely general and useful, immense technical difficulties may arise whenever the energy function involves some unknown parameters which have to be estimated from data. The reason is that the normalisation constant (which ensures that the distribution integrates to one) is in most cases impossible to compute. This prevents direct application of classical methods of maximum likelihood or Bayesian inference, which all depend on the unknown normalisation constant.Many techniques have been developed in recent years for such problems, including contrastive divergence (Hinton 2002; Bengio and Delalleau 2009), noise-contrastive estimation (Gutmann and Hyvärinen 2012) and various forms of MCMC for Bayesian inference (Møller et al. 2006; Murray et al. 2006; Girolami et al. 2013). The difficulty is compounded when unnormalised models are used for non-IID data, either sequential data, or data that include covariates. If the data form a sequence of length n, there are now n normalisation constants to approximate. In our application we look at models of spatial Markov chains, where the transition density of the chain is specified up to a normalisation constant, and again one normalisation constant needs to be estimated per observation.In the first Section, we show that unnormalised estimation is tightly related to the estimation of point process intensities, and formulate a Poisson transform that maps the log-likelihood of a model L(θθ) into an equivalent cost function M(θθ,νν) defined in an expanded space, where the latent variables νν effectively estimate the normalisation constants. In the case of non-IID unnormalised models we show further that optimisation of M(θθ,νν) can be turned into a semi-parametric problem and addressed using standard kernel methods. In the second section, we show that the noise-contrastive divergence described in Gutmann and Hyvärinen (2012) arises naturally as a tractable approximation of the Poisson transform, and that this new interpretation lets us extend its use to non-IID models. (Gutmann and Hyvärinen (2012) call the technique “noise-contrastive estimation”, but we use the term noise-contrastive divergence to designate the corresponding cost function.) Finally, we apply these results to a class of unnormalised spatial Markov chains that are natural descriptions of eye movement sequences.