Computer Science â€ș Artificial Intelligence

Bayesian Methods and Mixture Models

Description

This cluster of papers focuses on the application of mixture models, particularly Gaussian finite mixture models and Dirichlet process mixture models, for model-based clustering, discriminant analysis, density estimation, and unsupervised learning. It explores various inference methods such as Bayesian inference, variational inference, and Markov Chain Monte Carlo for estimating parameters in mixture models. The cluster also delves into the challenges of identifiability, variable selection, and dealing with label switching in the context of mixture models.

Keywords

Mixture Models; Clustering; Bayesian Inference; Dirichlet Process; Gaussian Mixture Models; Variational Inference; Markov Chain Monte Carlo; Finite Mixtures; Hidden Markov Models; Nonparametric Bayesian

We describe the maximum-likelihood parameter estimation problem and how the ExpectationMaximization (EM) algorithm can be used for its solution. We first describe the abstract form of the EM algorithm as 
 We describe the maximum-likelihood parameter estimation problem and how the ExpectationMaximization (EM) algorithm can be used for its solution. We first describe the abstract form of the EM algorithm as it is often given in the literature. We then develop the EM parameter estimation procedure for two applications: 1) finding the parameters of a mixture of Gaussian densities, and 2) finding the parameters of a hidden Markov model (HMM) (i.e., the Baum-Welch algorithm) for both discrete and Gaussian mixture observation models. We derive the update equations in fairly explicit detail but we do not prove any convergence properties. We try to emphasize intuition rather than mathematical rigor.
Abstract : The discrimination problem (two population case) may be defined as follows: e random variable Z, of observed value z, is distributed over some space (say, p-dimensional) either according 
 Abstract : The discrimination problem (two population case) may be defined as follows: e random variable Z, of observed value z, is distributed over some space (say, p-dimensional) either according to distribution F, or according to distribution G. The problem is to decide, on the basis of z, which of the two distributions Z has.
Let $x$ and $y$ be two random variables with continuous cumulative distribution functions $f$ and $g$. A statistic $U$ depending on the relative ranks of the $x$'s and $y$'s is 
 Let $x$ and $y$ be two random variables with continuous cumulative distribution functions $f$ and $g$. A statistic $U$ depending on the relative ranks of the $x$'s and $y$'s is proposed for testing the hypothesis $f = g$. Wilcoxon proposed an equivalent test in the Biometrics Bulletin, December, 1945, but gave only a few points of the distribution of his statistic. Under the hypothesis $f = g$ the probability of obtaining a given $U$ in a sample of $n x's$ and $m y's$ is the solution of a certain recurrence relation involving $n$ and $m$. Using this recurrence relation tables have been computed giving the probability of $U$ for samples up to $n = m = 8$. At this point the distribution is almost normal. From the recurrence relation explicit expressions for the mean, variance, and fourth moment are obtained. The 2rth moment is shown to have a certain form which enabled us to prove that the limit distribution is normal if $m, n$ go to infinity in any arbitrary manner. The test is shown to be consistent with respect to the class of alternatives $f(x) > g(x)$ for every $x$.
The problem of estimating the parameters which determine a mixture density has been the subject of a large, diverse body of literature spanning nearly ninety years. During the last two 
 The problem of estimating the parameters which determine a mixture density has been the subject of a large, diverse body of literature spanning nearly ninety years. During the last two decades, the method of maximum likelihood has become the most widely followed approach to this problem, thanks primarily to the advent of high speed electronic computers. Here, we first offer a brief survey of the literature directed toward this problem and review maximum-likelihood estimation for it. We then turn to the subject of ultimate interest, which is a particular iterative procedure for numerically approximating maximum-likelihood estimates for mixture density problems. This procedure, known as the EM algorithm, is a specialization to the mixture density context of a general algorithm of the same name used to approximate maximum-likelihood estimates for incomplete data problems. We discuss the formulation and theoretical and practical properties of the EM algorithm for mixture densities, focussing in particular on mixtures of densities from exponential families.
General Introduction Introduction History of Mixture Models Background to the General Classification Problem Mixture Likelihood Approach to Clustering Identifiability Likelihood Estimation for Mixture Models via EM Algorithm Start Values for 
 General Introduction Introduction History of Mixture Models Background to the General Classification Problem Mixture Likelihood Approach to Clustering Identifiability Likelihood Estimation for Mixture Models via EM Algorithm Start Values for EMm Algorithm Properties of Likelihood Estimators for Mixture Models Information Matrix for Mixture Models Tests for the Number of Components in a Mixture Partial Classification of the Data Classification Likelihood Approach to Clustering Mixture Models with Normal Components Likelihood Estimation for a Mixture of Normal Distribution Normal Homoscedastic Components Asymptotic Relative Efficiency of the Mixture Likelihood Approach Expected and Observed Information Matrices Assessment of Normality for Component Distributions: Partially Classified Data Assessment of Typicality: Partially Classified Data Assessment of Normality and Typicality: Unclassified Data Robust Estimation for Mixture Models Applications of Mixture Models to Two-Way Data Sets Introduction Clustering of Hemophilia Data Outliers in Darwin's Data Clustering of Rare Events Latent Classes of Teaching Styles Estimation of Mixing Proportions Introduction Likelihood Estimation Discriminant Analysis Estimator Asymptotic Relative Efficiency of Discriminant Analysis Estimator Moment Estimators Minimum Distance Estimators Case Study Homogeneity of Mixing Proportions Assessing the Performance of the Mixture Likelihood Approach to Clustering Introduction Estimators of the Allocation Rates Bias Correction of the Estimated Allocation Rates Estimated Allocation Rates of Hemophilia Data Estimated Allocation Rates for Simulated Data Other Methods of Bias Corrections Bias Correction for Estimated Posterior Probabilities Partitioning of Treatment Means in ANOVA Introduction Clustering of Treatment Means by the Mixture Likelihood Approach Fitting of a Normal Mixture Model to a RCBD with Random Block Effects Some Other Methods of Partitioning Treatment Means Example 1 Example 2 Example 3 Example 4 Mixture Likelihood Approach to the Clustering of Three-Way Data Introduction Fitting a Normal Mixture Model to Three-Way Data Clustering of Soybean Data Multidimensional Scaling Approach to the Analysis of Soybean Data References Appendix
Cluster analysis is the automated search for groups of related observations in a dataset. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures, and most 
 Cluster analysis is the automated search for groups of related observations in a dataset. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures, and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled. We review a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, minefield detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology and discuss recent developments in model-based clustering for non-Gaussian data, high-dimensional datasets, large datasets, and Bayesian estimation.
This note discusses some aspects of the estimation of the density function of a univariate probability distribution. All estimates of the density function satisfying relatively mild conditions are shown to 
 This note discusses some aspects of the estimation of the density function of a univariate probability distribution. All estimates of the density function satisfying relatively mild conditions are shown to be biased. The asymptotic mean square error of a particular class of estimates is evaluated.
This paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective "unsupervised" is justified by two properties of the algorithm: 1) it is capable 
 This paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective "unsupervised" is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the standard expectation-maximization (EM) algorithm, it does not require careful initialization. The proposed method also avoids another drawback of EM for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space. The novelty of our approach is that we do not use a model selection criterion to choose one among a set of preestimated candidate models; instead, we seamlessly integrate estimation and model selection in a single algorithm. Our technique can be applied to any type of parametric mixture model for which it is possible to write an EM algorithm; in this paper, we illustrate it with experiments involving Gaussian mixtures. These experiments testify for the good performance of our approach.
Radiocarbon dating is routinely used in paleoecology to build chronologies of lake and peat sediments, aiming at inferring a model that would relate the sediment depth with its age. We 
 Radiocarbon dating is routinely used in paleoecology to build chronologies of lake and peat sediments, aiming at inferring a model that would relate the sediment depth with its age. We present a new approach for chronology building (called "Bacon") that has received enthusiastic attention by paleoecologists. Our methodology is based on controlling core accumulation rates using a gamma autoregressive semiparametric model with an arbitrary number of subdivisions along the sediment. Using prior knowledge about accumulation rates is crucial and informative priors are routinely used. Since many sediment cores are currently analyzed, using different data sets and prior distributions, a robust (adaptive) MCMC is very useful. We use the t-walk (Christen and Fox, 2010), a self adjusting, robust MCMC sampling algorithm, that works acceptably well in many situations. Outliers are also addressed using a recent approach that considers a Student-t model for radiocarbon data. Two examples are presented here, that of a peat core and a core from a lake, and our results are compared with other approaches.
Journal Article Testing the number of components in a normal mixture Get access Yungtai Lo, Yungtai Lo Search for other works by this author on: Oxford Academic Google Scholar Nancy 
 Journal Article Testing the number of components in a normal mixture Get access Yungtai Lo, Yungtai Lo Search for other works by this author on: Oxford Academic Google Scholar Nancy R. Mendell, Nancy R. Mendell Search for other works by this author on: Oxford Academic Google Scholar Donald B. Rubin Donald B. Rubin Search for other works by this author on: Oxford Academic Google Scholar Biometrika, Volume 88, Issue 3, 1 October 2001, Pages 767–778, https://doi.org/10.1093/biomet/88.3.767 Published: 01 October 2001
Preface.PART I: OVERVIEW AND BASIC APPROACHES.Introduction.Missing Data in Experiments.Complete-Case and Available-Case Analysis, Including Weighting Methods.Single Imputation Methods.Estimation of Imputation Uncertainty.PART II: LIKELIHOOD-BASED APPROACHES TO THE ANALYSIS OF MISSING DATA.Theory of 
 Preface.PART I: OVERVIEW AND BASIC APPROACHES.Introduction.Missing Data in Experiments.Complete-Case and Available-Case Analysis, Including Weighting Methods.Single Imputation Methods.Estimation of Imputation Uncertainty.PART II: LIKELIHOOD-BASED APPROACHES TO THE ANALYSIS OF MISSING DATA.Theory of Inference Based on the Likelihood Function.Methods Based on Factoring the Likelihood, Ignoring the Missing-Data Mechanism.Maximum Likelihood for General Patterns of Missing Data: Introduction and Theory with Ignorable Nonresponse.Large-Sample Inference Based on Maximum Likelihood Estimates.Bayes and Multiple Imputation.PART III: LIKELIHOOD-BASED APPROACHES TO THE ANALYSIS OF MISSING DATA: APPLICATIONS TO SOME COMMON MODELS.Multivariate Normal Examples, Ignoring the Missing-Data Mechanism.Models for Robust Estimation.Models for Partially Classified Contingency Tables, Ignoring the Missing-Data Mechanism.Mixed Normal and Nonnormal Data with Missing Values, Ignoring the Missing-Data Mechanism.Nonignorable Missing-Data Models.References.Author Index.Subject Index.
We extend the jackknife and the bootstrap method of estimating standard errors to the case where the observations form a general stationary sequence. We do not attempt a reduction to 
 We extend the jackknife and the bootstrap method of estimating standard errors to the case where the observations form a general stationary sequence. We do not attempt a reduction to i.i.d. values. The jackknife calculates the sample variance of replicates of the statistic obtained by omitting each block of $l$ consecutive data once. In the case of the arithmetic mean this is shown to be equivalent to a weighted covariance estimate of the spectral density of the observations at zero. Under appropriate conditions consistency is obtained if $l = l(n) \rightarrow \infty$ and $l(n)/n \rightarrow 0$. General statistics are approximated by an arithmetic mean. In regular cases this approximation determines the asymptotic behavior. Bootstrap replicates are constructed by selecting blocks of length $l$ randomly with replacement among the blocks of observations. The procedures are illustrated by using the sunspot numbers and some simulated data.
Introduction Stochastic simulation Introduction Generation of Discrete Random Quantities Generation of Continuous Random Quantities Generation of Random Vectors and Matrices Resampling Methods Exercises Bayesian Inference Introduction Bayes' Theorem Conjugate Distributions 
 Introduction Stochastic simulation Introduction Generation of Discrete Random Quantities Generation of Continuous Random Quantities Generation of Random Vectors and Matrices Resampling Methods Exercises Bayesian Inference Introduction Bayes' Theorem Conjugate Distributions Hierarchical Models Dynamic Models Spatial Models Model Comparison Exercises Approximate methods of inference Introduction Asymptotic Approximations Approximations by Gaussian Quadrature Monte Carlo Integration Methods Based on Stochastic Simulation Exercises Markov chains Introduction Definition and Transition Probabilities Decomposition of the State Space Stationary Distributions Limiting Theorems Reversible Chains Continuous State Spaces Simulation of a Markov Chain Data Augmentation or Substitution Sampling Exercises Gibbs Sampling Introduction Definition and Properties Implementation and Optimization Convergence Diagnostics Applications MCMC-Based Software for Bayesian Modeling Appendix 5.A: BUGS Code for Example 5.7 Appendix 5.B: BUGS Code for Example 5.8 Exercises Metropolis-Hastings algorithms Introduction Definition and Properties Special Cases Hybrid Algorithms Applications Exercises Further topics in MCMC Introduction Model Adequacy Model Choice: MCMC Over Model and Parameter Spaces Convergence Acceleration Exercises References Author Index Subject Index
The Bayesian approach to statistical problems, though fruitful in many ways, has been rather unsuccessful in treating nonparametric problems. This is due primarily to the difficulty in finding workable prior 
 The Bayesian approach to statistical problems, though fruitful in many ways, has been rather unsuccessful in treating nonparametric problems. This is due primarily to the difficulty in finding workable prior distributions on the parameter space, which in nonparametric ploblems is taken to be a set of probability distributions on a given sample space. There are two desirable properties of a prior distribution for nonparametric problems. (I) The support of the prior distribution should be large--with respect to some suitable topology on the space of probability distributions on the sample space. (II) Posterior distributions given a sample of observations from the true probability distribution should be manageable analytically. These properties are antagonistic in the sense that one may be obtained at the expense of the other. This paper presents a class of prior distributions, called Dirichlet process priors, broad in the sense of (I), for which (II) is realized, and for which treatment of many nonparametric statistical problems may be carried out, yielding results that are comparable to the classical theory. In Section 2, we review the properties of the Dirichlet distribution needed for the description of the Dirichlet process given in Section 3. Briefly, this process may be described as follows. Let $\mathscr{X}$ be a space and $\mathscr{A}$ a $\sigma$-field of subsets, and let $\alpha$ be a finite non-null measure on $(\mathscr{X}, \mathscr{A})$. Then a stochastic process $P$ indexed by elements $A$ of $\mathscr{A}$, is said to be a Dirichlet process on $(\mathscr{X}, \mathscr{A})$ with parameter $\alpha$ if for any measurable partition $(A_1, \cdots, A_k)$ of $\mathscr{X}$, the random vector $(P(A_1), \cdots, P(A_k))$ has a Dirichlet distribution with parameter $(\alpha(A_1), \cdots, \alpha(A_k)). P$ may be considered a random probability measure on $(\mathscr{X}, \mathscr{A})$, The main theorem states that if $P$ is a Dirichlet process on $(\mathscr{X}, \mathscr{A})$ with parameter $\alpha$, and if $X_1, \cdots, X_n$ is a sample from $P$, then the posterior distribution of $P$ given $X_1, \cdots, X_n$ is also a Dirichlet process on $(\mathscr{X}, \mathscr{A})$ with a parameter $\alpha + \sum^n_1 \delta_{x_i}$, where $\delta_x$ denotes the measure giving mass one to the point $x$. In Section 4, an alternative definition of the Dirichlet process is given. This definition exhibits a version of the Dirichlet process that gives probability one to the set of discrete probability measures on $(\mathscr{X}, \mathscr{A})$. This is in contrast to Dubins and Freedman [2], whose methods for choosing a distribution function on the interval [0, 1] lead with probability one to singular continuous distributions. Methods of choosing a distribution function on [0, 1] that with probability one is absolutely continuous have been described by Kraft [7]. The general method of choosing a distribution function on [0, 1], described in Section 2 of Kraft and van Eeden [10], can of course be used to define the Dirichlet process on [0, 1]. Special mention must be made of the papers of Freedman and Fabius. Freedman [5] defines a notion of tailfree for a distribution on the set of all probability measures on a countable space $\mathscr{X}$. For a tailfree prior, posterior distribution given a sample from the true probability measure may be fairly easily computed. Fabius [3] extends the notion of tailfree to the case where $\mathscr{X}$ is the unit interval [0, 1], but it is clear his extension may be made to cover quite general $\mathscr{X}$. With such an extension, the Dirichlet process would be a special case of a tailfree distribution for which the posterior distribution has a particularly simple form. There are disadvantages to the fact that $P$ chosen by a Dirichlet process is discrete with probability one. These appear mainly because in sampling from a $P$ chosen by a Dirichlet process, we expect eventually to see one observation exactly equal to another. For example, consider the goodness-of-fit problem of testing the hypothesis $H_0$ that a distribution on the interval [0, 1] is uniform. If on the alternative hypothesis we place a Dirichlet process prior with parameter $\alpha$ itself a uniform measure on [0, 1], and if we are given a sample of size $n \geqq 2$, the only nontrivial nonrandomized Bayes rule is to reject $H_0$ if and only if two or more of the observations are exactly equal. This is really a test of the hypothesis that a distribution is continuous against the hypothesis that it is discrete. Thus, there is still a need for a prior that chooses a continuous distribution with probability one and yet satisfies properties (I) and (II). Some applications in which the possible doubling up of the values of the observations plays no essential role are presented in Section 5. These include the estimation of a distribution function, of a mean, of quantiles, of a variance and of a covariance. A two-sample problem is considered in which the Mann-Whitney statistic, equivalent to the rank-sum statistic, appears naturally. A decision theoretic upper tolerance limit for a quantile is also treated. Finally, a hypothesis testing problem concerning a quantile is shown to yield the sign test. In each of these problems, useful ways of combining prior information with the statistical observations appear. Other applications exist. In his Ph. D. dissertation [1], Charles Antoniak finds a need to consider mixtures of Dirichlet processes. He treats several problems, including the estimation of a mixing distribution, bio-assay, empirical Bayes problems, and discrimination problems.
Summary We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. 
 Summary We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.
A Correction has been published for this article in Statistics in Medicine 2001; 20:655. The identification of changes in the recent trend is an important issue in the analysis of 
 A Correction has been published for this article in Statistics in Medicine 2001; 20:655. The identification of changes in the recent trend is an important issue in the analysis of cancer mortality and incidence data. We apply a joinpoint regression model to describe such continuous changes and use the grid-search method to fit the regression function with unknown joinpoints assuming constant variance and uncorrelated errors. We find the number of significant joinpoints by performing several permutation tests, each of which has a correct significance level asymptotically. Each p-value is found using Monte Carlo methods, and the overall asymptotic significance level is maintained through a Bonferroni correction. These tests are extended to the situation with non-constant variance to handle rates with Poisson variation and possibly autocorrelated errors. The performance of these tests are studied via simulations and the tests are applied to U.S. prostate cancer incidence and mortality rates. Copyright © 2000 John Wiley & Sons, Ltd.
Abstract This article reviews Markov chain methods for sampling from the posterior distribution of a Dirichlet process mixture model and presents two new classes of methods. One new approach is 
 Abstract This article reviews Markov chain methods for sampling from the posterior distribution of a Dirichlet process mixture model and presents two new classes of methods. One new approach is to make Metropolis—Hastings updates of the indicators specifying which mixture component is associated with each observation, perhaps supplemented with a partial form of Gibbs sampling. The other new approach extends Gibbs sampling for these indicators by using a set of auxiliary parameters. These methods are simple to implement and are more efficient than previous ways of handling general Dirichlet process mixture models with non-conjugate priors.
Abstract : The classification maximum likelihood approach is sufficiently general to encompass many current clustering algorithms, including those based on the sum of squares criterion and on the criterion of 
 Abstract : The classification maximum likelihood approach is sufficiently general to encompass many current clustering algorithms, including those based on the sum of squares criterion and on the criterion of Friedman and Rubin (1967). However, as currently implemented, it does not allow the specification of which features (orientation, size and shape) are to be common to all clusters and which may differ between clusters. Also, it is restricted to Gaussian distributions and it does not allow for noise. We propose ways of overcoming these limitations. A reparameterization of the covariance matrix allows us to specify that some features, but not all, be the same for all clusters. A practical framework for non-Gaussian clustering is outlined, and a means of incorporating noise in the form of a Poisson process is described. An approximate Bayesian method for choosing the number of clusters is given. The performance of the proposed methods is studied by simulation, with encouraging results. The methods are applied to the analysis of a data set arising in the study of diabetes, and the results seem better than those of previous analyses. (RH)
Abstract We describe and illustrate Bayesian inference in models for density estimation using mixtures of Dirichlet processes. These models provide natural settings for density estimation and are exemplified by special 
 Abstract We describe and illustrate Bayesian inference in models for density estimation using mixtures of Dirichlet processes. These models provide natural settings for density estimation and are exemplified by special cases where data are modeled as a sample from mixtures of normal distributions. Efficient simulation methods are used to approximate various prior, posterior, and predictive distributions. This allows for direct inference on a variety of practical issues, including problems of local versus global smoothing, uncertainty about density estimates, assessment of modality, and the inference on the numbers of components. Also, convergence results are established for a general class of normal mixture models.
Abstract We propose a new method for approximate Bayesian statistical inference on the basis of summary statistics. The method is suited to complex problems that arise in population genetics, extending 
 Abstract We propose a new method for approximate Bayesian statistical inference on the basis of summary statistics. The method is suited to complex problems that arise in population genetics, extending ideas developed in this setting by earlier authors. Properties of the posterior distribution of a parameter, such as its mean or density curve, are approximated without explicit likelihood calculations. This is achieved by fitting a local-linear regression of simulated parameter values on simulated summary statistics, and then substituting the observed summary statistics into the regression equation. The method combines many of the advantages of Bayesian statistical inference with the computational efficiency of methods based on summary statistics. A key advantage of the method is that the nuisance parameters are automatically integrated out in the simulation step, so that the large numbers of nuisance parameters that arise in population genetics problems can be handled without difficulty. Simulation results indicate computational and statistical efficiency that compares favorably with those of alternative methods previously proposed in the literature. We also compare the relative efficiency of inferences obtained using methods based on summary statistics with those obtained directly from the data using MCMC.
Abstract : Given a sequence of independent identically distributed random variables with a common probability density function, the problem of the estimation of a probability density function and of determining 
 Abstract : Given a sequence of independent identically distributed random variables with a common probability density function, the problem of the estimation of a probability density function and of determining the mode of a probability function are discussed. Only estimates which are consistent and asymptotically normal are constructed. (Author)
This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of 
 This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">&gt;</ETX>
The finite mixture (FM) model is the most commonly used model for statistical segmentation of brain magnetic resonance (MR) images because of its simple mathematical form and the piecewise constant 
 The finite mixture (FM) model is the most commonly used model for statistical segmentation of brain magnetic resonance (MR) images because of its simple mathematical form and the piecewise constant nature of ideal brain MR images. However, being a histogram-based model, the FM has an intrinsic limitation--no spatial information is taken into account. This causes the FM model to work only on well-defined images with low levels of noise; unfortunately, this is often not the the case due to artifacts such as partial volume effect and bias field distortion. Under these conditions, FM model-based methods produce unreliable results. In this paper, we propose a novel hidden Markov random field (HMRF) model, which is a stochastic process generated by a MRF whose state sequence cannot be observed directly but which can be indirectly estimated through observations. Mathematically, it can be shown that the FM model is a degenerate version of the HMRF model. The advantage of the HMRF model derives from the way in which the spatial information is encoded through the mutual influences of neighboring sites. Although MRF modeling has been employed in MR image segmentation by other researchers, most reported methods are limited to using MRF as a general prior in an FM model-based approach. To fit the HMRF model, an EM algorithm is used. We show that by incorporating both the HMRF model and the EM algorithm into a HMRF-EM framework, an accurate and robust segmentation can be achieved. More importantly, the HMRF-EM framework can easily be combined with other techniques. As an example, we show how the bias field correction algorithm of Guillemaud and Brady (1997) can be incorporated into this framework to achieve a three-dimensional fully automated approach for brain MR image segmentation.
The Gibbs sampler, the algorithm of Metropolis and similar iterative simulation methods are potentially very helpful for summarizing multivariate distributions. Used naively, however, iterative simulation can give misleading answers. Our 
 The Gibbs sampler, the algorithm of Metropolis and similar iterative simulation methods are potentially very helpful for summarizing multivariate distributions. Used naively, however, iterative simulation can give misleading answers. Our methods are simple and generally applicable to the output of any iterative simulation; they are designed for researchers primarily interested in the science underlying the data and models they are analyzing, rather than for researchers interested in the probability theory underlying the iterative simulations themselves. Our recommended strategy is to use several independent sequences, with starting points sampled from an overdispersed distribution. At each step of the iterative simulation, we obtain, for each univariate estimand of interest, a distributional estimate and an estimate of how much sharper the distributional estimate might become if the simulations were continued indefinitely. Because our focus is on applied inference for Bayesian posterior distributions in real problems, which often tend toward normality after transformations and marginalization, we derive our results as normal-theory approximations to exact Bayesian inference, conditional on the observed simulations. The methods are illustrated on a random-effects mixture model applied to experimental measurements of reaction times of normal and schizophrenic patients.
AbstractWe consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between 
 AbstractWe consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stick-breaking process, and a generalization of the Chinese restaurant process that we refer to as the “Chinese restaurant franchise.” We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures and describe applications to problems in information retrieval and text modeling.KEY WORDS: ClusteringHierarchical modelMarkov chain Monte CarloMixture modelNonparametric Bayesian statistics
As the need grows for conceptualization, formalization, and abstraction in biology, so too does mathematics' relevance to the field (Fagerström et al. 1996).Mathematics is particularly important for analyzing and characterizing 
 As the need grows for conceptualization, formalization, and abstraction in biology, so too does mathematics' relevance to the field (Fagerström et al. 1996).Mathematics is particularly important for analyzing and characterizing random variation of, for example, size and weight of individuals in populations, their sensitivity to chemicals, and time-to-event cases, such as the amount of time an individual needs to recover from illness.The frequency distribution of such data is a major factor determining the type of statistical analysis that can be validly carried out on any data set.Many widely used statistical methods, such as ANOVA (analysis of variance) and regression analysis, require that the data be normally distributed, but only rarely is the frequency distribution of data tested when these techniques are used.The Gaussian (normal) distribution is most often assumed to describe the random variation that occurs in the data from many scientific disciplines; the well-known bell-shaped curve can easily be characterized and described by two values: the arithmetic mean x and the standard deviation s, so that data sets are commonly described by the expression x ± s.A historical example of a normal distribution is that of chest measurements of Scottish soldiers made by Quetelet, Belgian founder of modern social statistics (Swoboda 1974).In addition, such disparate phenomena as milk production by cows and random deviations from target values in industrial processes fit a normal distribution.However, many measurements show a more or less skewed distribution.Skewed distributions are particularly common when mean values are low, variances large, and values cannot be negative, as is the case, for example, with species abundance, lengths of latent periods of infectious diseases, and distribution of mineral resources in the Earth's crust.Such skewed distributions often closely fit the log-normal distribution (Aitchison and Brown 1957, Crow and Shimizu 1988, Lee 1992, Johnson et al. 1994, Sachs 1997).Examples fitting the normal distribution, which is symmetrical, and the lognormal distribution, which is skewed, are given in Figure 1.Note that body height fits both distributions.Often, biological mechanisms induce log-normal distributions (Koch 1966), as when, for instance, exponential growth is combined with further symmetrical variation: With a mean concentration of, say, 10 6 bacteria, one cell division moreor less-will lead to 2 × 10 6 -or 5 × 10 5 -cells.Thus, the range will be asymmetrical-to be precise, multiplied or divided by 2 around the mean.The skewed size distribution may be why "exceptionally" big fruit are reported in journals year after year in autumn.Such exceptions, however, may well be the rule: Inheritance of fruit and flower size has long been known to fit the log-normal distribution (Groth 1914, Powers 1936, Sinnot 1937).What is the difference between normal and log-normal variability?Both forms of variability are based on a variety of forces acting independently of one another.A major difference, however, is that the effects can be additive or multiplicative, thus leading to normal or log-normal distributions, respectively.
Journal Article Rank Correlation Methods Get access Rank Correlation Methods. By Maurice G. Kendall. (London: Charles Griffin, 1948. Pp. vi + 160. 18s.) R. C. Geary R. C. Geary Dublin 
 Journal Article Rank Correlation Methods Get access Rank Correlation Methods. By Maurice G. Kendall. (London: Charles Griffin, 1948. Pp. vi + 160. 18s.) R. C. Geary R. C. Geary Dublin Search for other works by this author on: Oxford Academic Google Scholar The Economic Journal, Volume 59, Issue 236, 1 December 1949, Pages 575–577, https://doi.org/10.2307/2226580 Published: 01 December 1949
19. Statistical Analysis with Missing Data. By R. J. A. Little and D. B. Rubin. ISBN 0 471 80254 9. Wiley, Chichester, 1987. xiv + 278 pp. ÂŁ32.05. 19. Statistical Analysis with Missing Data. By R. J. A. Little and D. B. Rubin. ISBN 0 471 80254 9. Wiley, Chichester, 1987. xiv + 278 pp. ÂŁ32.05.
Preface. 1. Preliminary Information. 2. Families of Discrete Distributions. 3. Binomial Distributions. 4. Poisson Distributions. 5. Neggative Binomial Distributions. 6. Hypergeometric Distributions. 7. Logarithmic and Lagrangian Distributions. 8. Mixture Distributions. 
 Preface. 1. Preliminary Information. 2. Families of Discrete Distributions. 3. Binomial Distributions. 4. Poisson Distributions. 5. Neggative Binomial Distributions. 6. Hypergeometric Distributions. 7. Logarithmic and Lagrangian Distributions. 8. Mixture Distributions. 9. Stopped-Sum Distributions. 10. Matching, Occupancy, Runs, and q-Series Distributions. 11. Parametric Regression Models and Miscellanea. Bibliography. Abbreviations. Index.
Finite mixture models are being used increasingly to model a wide variety of random phenomena for clustering, classification and density estimation. mclust is a powerful and popular package which allows 
 Finite mixture models are being used increasingly to model a wide variety of random phenomena for clustering, classification and density estimation. mclust is a powerful and popular package which allows modelling of data as a Gaussian finite mixture with different covariance structures and different numbers of mixture components, for a variety of purposes of analysis. Recently, version 5 of the package has been made available on CRAN. This updated version adds new covariance structures, dimension reduction capabilities for visualisation, model selection criteria, initialisation strategies for the EM algorithm, and bootstrap-based inference, making it a full-featured R package for data analysis via finite mixture modelling.
Preface.PART I: OVERVIEW AND BASIC APPROACHES.Introduction.Missing Data in Experiments.Complete-Case and Available-Case Analysis, Including Weighting Methods.Single Imputation Methods.Estimation of Imputation Uncertainty.PART II: LIKELIHOOD-BASED APPROACHES TO THE ANALYSIS OF MISSING DATA.Theory of 
 Preface.PART I: OVERVIEW AND BASIC APPROACHES.Introduction.Missing Data in Experiments.Complete-Case and Available-Case Analysis, Including Weighting Methods.Single Imputation Methods.Estimation of Imputation Uncertainty.PART II: LIKELIHOOD-BASED APPROACHES TO THE ANALYSIS OF MISSING DATA.Theory of Inference Based on the Likelihood Function.Methods Based on Factoring the Likelihood, Ignoring the Missing-Data Mechanism.Maximum Likelihood for General Patterns of Missing Data: Introduction and Theory with Ignorable Nonresponse.Large-Sample Inference Based on Maximum Likelihood Estimates.Bayes and Multiple Imputation.PART III: LIKELIHOOD-BASED APPROACHES TO THE ANALYSIS OF MISSING DATA: APPLICATIONS TO SOME COMMON MODELS.Multivariate Normal Examples, Ignoring the Missing-Data Mechanism.Models for Robust Estimation.Models for Partially Classified Contingency Tables, Ignoring the Missing-Data Mechanism.Mixed Normal and Nonnormal Data with Missing Values, Ignoring the Missing-Data Mechanism.Nonignorable Missing-Data Models.References.Author Index.Subject Index.
An $s$ stage $k$ name snowball sampling procedure is defined as follows: A random sample of individuals is drawn from a given finite population. (The kind of random sample will 
 An $s$ stage $k$ name snowball sampling procedure is defined as follows: A random sample of individuals is drawn from a given finite population. (The kind of random sample will be discussed later in this section.) Each individual in the sample is asked to name $k$ different individuals in the population, where $k$ is a specified integer; for example, each individual may be asked to name his "$k$ best friends," or the "$k$ individuals with whom he most frequently associates," or the "$k$ individuals whose opinions he most frequently seeks," etc. (For the sake of simplicity, we assume throughout that an individual cannot include himself in his list of $k$ individuals.) The individuals who were not in the random sample but were named by individuals in it form the first stage. Each of the individuals in the first stage is then asked to name $k$ different individuals. (We assume that the question asked of the individuals in the random sample and of those in each stage is the same and that $k$ is the same.) The individuals who were not in the random sample nor in the first stage but were named by individuals who were in the first stage form the second stage. Each of the individuals in the second stage is then asked to name $k$ different individuals. The individuals who were not in the random sample nor in the first or second stages but were named by individuals who were in the second stage form the third stage. Each of the individuals in the third stage is then asked to name $k$ different individuals. This procedure is continued until each of the individuals in the $s$th stage has been asked to name $k$ different individuals. The data obtained using an $s$ stage $k$ name snowball sampling procedure can be utilized to make statistical inferences about various aspects of the relationships present in the population. The relationships present, in the hypothetical situation where each individual in the population is asked to name $k$ different individuals, can be described by a matrix with rows and columns corresponding to the members of the population, rows for the individuals naming and columns for the individuals named, where the entry $\theta_{ij}$ in the $i$th row and $j$th column is 1 if the $i$th individual in the population includes the $j$th individual among the $k$ individuals he would name, and it is 0 otherwise. While the matrix of the $\theta$'s cannot be known in general unless every individual in the population is interviewed (i.e., asked to name $k$ different individuals), it will be possible to make statistical inferences about various aspects of this matrix from the data obtained using an $s$ stage $k$ name snowball sampling procedure. For example, when $s = k = 1$, the number, $M_{11}$, of mutual relationships present in the population (i.e., the number of values $i$ with $\theta_{ij} = \theta_{ji} = 1$ for some value of $j > i$) can be estimated. The methods of statistical inference applied to the data obtained from an $s$ stage $k$ name snowball sample will of course depend on the kind of random sample drawn as the initial step. In most of the present paper, we shall suppose that a random sample (i.e., the "zero stage" in snowball sample) is drawn so that the probability, $p$, that a given individual in the population will be in the sample is independent of whether a different given individual has appeared. This kind of sampling has been called binomial sampling; the specified value of $p$ (assumed known) has been called the sampling fraction [4]. This sampling scheme might also be described by saying that a given individual is included in the sample just when a coin, which has a probability $p$ of "heads," comes up "heads," where the tosses of the coin from individual to individual are independent. (To each individual there corresponds an independent Bernoulli trial determining whether he will or will not be included in the sample.) This sampling scheme differs in some respects from the more usual models where the sample size is fixed in advance or where the ratio of the sample size to the population size (i.e., the sample size-population size ratio) is fixed. For binomial sampling, this ratio is a random variable whose expected value is $p$. (The variance of this ratio approaches zero as the population becomes infinite.) In some situations (where, for example, the variance of this ratio is near zero), mathematical results obtained for binomial sampling are sometimes quite similar to results obtained using some of the more usual sampling models (see [4], [7]; compare the variance formulas in [3] and [5]); in such cases it will often not make much difference, from a practical point of view, which sampling model is utilized. (In Section 6 of the present paper some results for snowball sampling based on an initial sample of the more usual kind are obtained and compared with results presented in the earlier sections of this paper obtained for snowball sampling based on an initial binomial sample.) For snowball sampling based on an initial binomial sample, and with $s = k = 1$, so that each individual asked names just one other individual and there is just one stage beyond the initial sample, Section 2 of this paper discusses unbiased estimation of $M_{11}$, the number of pairs of individuals in the population who would name each other. One of the unbiased estimators considered (among a certain specified class of estimators) has uniformly smallest variance when the population characteristics are unknown; this one is based on a sufficient statistic for a simplified summary of the data and is the only unbiased estimator of $M_{11}$ based on that sufficient statistic (when the population characteristics are unknown). This estimator (when $s = k = 1$) has a smaller variance than a comparable minimum variance unbiased estimator computed from a larger random sample when $s = 0$ and $k = 1$ (i.e., where only the individuals in the random sample are interviewed) even where the expected number of individuals in the larger random sample $(s = 0, k = 1)$ is equal to the maximum expected number of individuals studied when $s = k = 1$ (i.e., the sum of the expected number of individuals in the initial sample and the maximum expected number of individuals in the first stage). In fact, the variance of the estimator when $s = 0$ and $k = 1$ is at least twice as large as the variance of the comparable estimator when $s = k = 1$ even where the expected number of individuals studied when $s = 0$ and $k = 1$ is as large as the maximum expected number of individuals studied when $s = k = 1$. Thus, for estimating $M_{11}$, the sampling scheme with $s = k = 1$ is preferable to the sampling scheme with $s = 0$ and $k = 1$. Furthermore, we observe that when $s = k = 1$ the unbiased estimator based on the simplified summary of the data having minimum variance when the population characteristics are unknown can be improved upon in cases where certain population characteristics are known, or where additional data not included in the simplified summary are available. Several improved estimators are derived and discussed. Some of the results for the special case of $s = k = 1$ are generalized in Sections 3 and 4 to deal with cases where $s$ and $k$ are any specified positive integers. In Section 5, results are presented about $s$ stage $k$ name snowball sampling procedures, where each individual asked to name $k$ different individuals chooses $k$ individuals at random from the population. (Except in Section 5, the numbers $\theta_{ij}$, which form the matrix referred to earlier, are assumed to be fixed (i.e., to be population parameters); in Section 5, they are random variables. A variable response error is not considered except in so far as Section 5 deals with an extreme case of this.) For social science literature that discusses problems related to snowball sampling, see [2], [8], and the articles they cite. This literature indicates, among other things, the importance of studying "social structure and...the relations among individuals" [2].
Journal Article Partial likelihood Get access D. R. COX D. R. COX Department of Mathematics, Imperial CollegeLondon Search for other works by this author on: Oxford Academic Google Scholar Biometrika, 
 Journal Article Partial likelihood Get access D. R. COX D. R. COX Department of Mathematics, Imperial CollegeLondon Search for other works by this author on: Oxford Academic Google Scholar Biometrika, Volume 62, Issue 2, August 1975, Pages 269–276, https://doi.org/10.1093/biomet/62.2.269 Published: 01 August 1975 Article history Received: 01 January 1975 Revision received: 01 January 1975 Published: 01 August 1975
Radiocarbon dating is routinely used in paleoecology to build chronologies of lake and peat sediments, aiming at inferring a model that would relate the sediment depth with its age. We 
 Radiocarbon dating is routinely used in paleoecology to build chronologies of lake and peat sediments, aiming at inferring a model that would relate the sediment depth with its age. We present a new approach for chronology building (called "Bacon") that has received enthusiastic attention by paleoecologists. Our methodology is based on controlling core accumulation rates using a gamma autoregressive semiparametric model with an arbitrary number of subdivisions along the sediment. Using prior knowledge about accumulation rates is crucial and informative priors are routinely used. Since many sediment cores are currently analyzed, using different data sets and prior distributions, a robust (adaptive) MCMC is very useful. We use the t-walk (Christen and Fox, 2010), a self adjusting, robust MCMC sampling algorithm, that works acceptably well in many situations. Outliers are also addressed using a recent approach that considers a Student-t model for radiocarbon data. Two examples are presented here, that of a peat core and a core from a lake, and our results are compared with other approaches.
Abstract Intensification of the water cycle due to climate change has a significant impact on human societies and ecosystems. Globally, 77% of precipitation and 85% of evaporation occur over the 
 Abstract Intensification of the water cycle due to climate change has a significant impact on human societies and ecosystems. Globally, 77% of precipitation and 85% of evaporation occur over the oceans. Precipitation, river run-off and ice melt dilute and evaporation and ice formation concentrates salinity in sea water, making ocean salinity a crucial indicator for quantifying water cycle change. However, ocean salinity is relatively under-observed, compared to ocean temperature, making long-term variations in salinity challenging to quantify. This study focuses on the development of a method to create 2-dimensional maps of ocean salinity and its trends on pressure surfaces from sparse observations. To achieve this, we employ an unsupervised classification technique called Gaussian Mixture Modeling (GMM). GMM is used to identify coherent regions where temperature and salinity are tightly related at constant pressure. Within each region, sparse observations of salinity are aggregated and means and trends quantified. The method is assessed using temperature and salinity data from the climate model ACCESS-CM2, sub-sampled to mimic historical observational programs. We demonstrate that the method has skill in predicting historical salinity trends from 1970 to 2014. In the South Atlantic, at a pressure of 539 dbar, the root mean square error of salinity and of the salinity linear trend are 0.040g kg −1 and 2.1 10 −3 g kg −1 yr −1 . Applying this methodology at all depth levels and across all oceans could help fill in missing salinity observations and thus improve our understanding of the intensification of the global water cycle in response to climate change.
Ya Su | Australian & New Zealand Journal of Statistics
ABSTRACT Datasets for statistical analysis become extremely large even when stored on one single machine with some difficulty. Even when the data can be stored in one machine, the computational 
 ABSTRACT Datasets for statistical analysis become extremely large even when stored on one single machine with some difficulty. Even when the data can be stored in one machine, the computational cost would still be intimidating. We propose a divide and conquer solution to density estimation using Bayesian mixture modelling, including the infinite mixture case. The methodology can be generalised to other application problems where a Bayesian mixture model is adopted. The proposed prior on each machine or subgroup modifies the original prior on both mixing probabilities and the rest of parameters in the distributions being mixed. The ultimate estimator is obtained by taking the average of the posterior samples corresponding to the proposed prior on each subset. Despite the tremendous reduction in time thanks to data splitting, the posterior contraction rate of the proposed estimator stays the same (up to a factor) as that using the original prior when the data is analysed as a whole. Simulation studies also justify the competency of the proposed method compared to the established WASP estimator in the finite‐dimension case. In addition, one of our simulations is performed in a shape‐constrained deconvolution context and reveals promising results. The application to a GWAS dataset reveals the advantage over a naive divide and conquer method that uses the original prior.
This article studies the robustness of quasi-maximum-likelihood estimation in hidden Markov models when the regime-switching structure is misspecified. Specifically, we examine the case where the data-generating process features a hidden 
 This article studies the robustness of quasi-maximum-likelihood estimation in hidden Markov models when the regime-switching structure is misspecified. Specifically, we examine the case where the data-generating process features a hidden Markov regime sequence with covariate-dependent transition probabilities, but estimation proceeds under a simplified mixture model that assumes regimes are independent and identically distributed. We show that the parameters governing the conditional distribution of the observables can still be consistently estimated under this misspecification, provided certain regularity conditions hold. Our results highlight a practical benefit of using computationally simpler mixture models in settings where regime dependence is complex or difficult to model directly.
Abstract The ever-increasing availability of high-throughput DNA sequences and the development of numerous computational methods have led to considerable advances in our understanding of the evolutionary and demographic history of 
 Abstract The ever-increasing availability of high-throughput DNA sequences and the development of numerous computational methods have led to considerable advances in our understanding of the evolutionary and demographic history of populations. Several demographic inference methods have been developed to take advantage of these massive genomic data. Simulation-based approaches, such as approximate Bayesian computation (ABC), have proved particularly efficient for complex demographic models. However, taking full advantage of the comprehensive information contained in massive genomic data remains a challenge for demographic inference methods, which generally rely on partial information from these data. Using advanced computational methods, such as machine learning, is valuable for efficiently integrating more comprehensive information. Here, we showed how simulation-based supervised machine learning methods applied to an extensive range of summary statistics are effective in inferring demographic parameters for connected populations. We compared three machine learning (ML) methods: a neural network, the multilayer perceptron (MLP), and two ensemble methods, random forest (RF) and the gradient boosting system XGBoost (XGB), to infer demographic parameters from genomic data under a standard isolation with migration model and a secondary contact model with varying population sizes. We showed that MLP outperformed the other two methods and that, on the basis of permutation feature importance, its predictions involved a larger combination of summary statistics. Moreover, they outperformed all three tested ABC algorithms. Finally, we demonstrated how a method called SHAP, from the field of explainable artificial intelligence, can be used to shed light on the contribution of summary statistics within the ML models.
ABSTRACT Closed‐form expressions for the score vector and the Hessian matrix of the log‐likelihood function are derived for mixtures of matrix‐variate normal distributions. These results are obtained by exploiting properties 
 ABSTRACT Closed‐form expressions for the score vector and the Hessian matrix of the log‐likelihood function are derived for mixtures of matrix‐variate normal distributions. These results are obtained by exploiting properties of the trace operator and the Kronecker product, enabling fast and reliable computation of standard errors and eliminating the need for costly numerical differentiation. The advantages of the approach are highlighted through a comprehensive simulation study based on synthetic data under different scenarios.
ABSTRACT Variational autoencoders (VAEs) with symmetric mixture priors have shown exceptional performance in clustering applications. However, these methods often struggle with asymmetrical distributions and extreme values. Many fields, such as 
 ABSTRACT Variational autoencoders (VAEs) with symmetric mixture priors have shown exceptional performance in clustering applications. However, these methods often struggle with asymmetrical distributions and extreme values. Many fields, such as medical imaging, bioinformatics, finance, and atmospheric science, frequently deal with right‐skewed data. To address this, we propose a novel VAE clustering approach that utilizes a gamma mixture prior, effectively accommodating right‐skewed distributions. Our findings highlight the significance of selecting an appropriate distribution for the data. The proposed method demonstrates the ability to achieve a more parsimonious model with fewer parameters compared to other deep learning clustering techniques, which is essential for clustering high‐dimensional data with small sample sizes. We evaluate our approach on both synthetic and real right‐skewed datasets, demonstrating superior clustering performance compared to existing deep generative clustering methods and traditional statistical models.