$I$-Divergence Geometry of Probability Distributions and Minimization Problems

Authors

Type: Article
Publication Date: 1975-02-01
Citations: 1544
DOI: https://doi.org/10.1214/aop/1176996454

Abstract

Some geometric properties of PD's are established, Kullback's $I$-divergence playing the role of squared Euclidean distance. The minimum discrimination information problem is viewed as that of projecting a PD onto a convex set of PD's and useful existence theorems for and characterizations of the minimizing PD are arrived at. A natural generalization of known iterative algorithms converging to the minimizing PD in special situations is given; even for those special cases, our convergence proof is more generally valid than those previously published. As corollaries of independent interest, generalizations of known results on the existence of PD's or nonnegative matrices of a certain form are obtained. The Lagrange multiplier technique is not used.

Locations

  • The Annals of Probability

Ask a Question About This Paper

Summary

Login to see paper summary

Abstract The problem of finding the probability vector lying within a particular set ε which minimizes the l-divergence between it and a specified probability vector r (the l-projection of r … Abstract The problem of finding the probability vector lying within a particular set ε which minimizes the l-divergence between it and a specified probability vector r (the l-projection of r onto ε) is often a difficult one. Here we establish a duality relationship between this problem and one of minimizing a linear combination of exponential functions. The latter problem is often much easier to solve. This fact coupled with a recent algorithm of Dykstra provides a new method of solving difficult l-projection problems. It is also shown that when ε is a cone of isotonic vectors, the l-projection can be precisely expressed as a certain function of least squares equal weights projections. Keywords: l-divergence l-projectionsconvexityKullback–Leibler information numbercross entropyFenchel duality theoremisotonic conesleast squares projections This work was partially supported by ONR contract N000114-83-K-0249. This work was partially supported by ONR contract N000114-83-K-0249. Notes This work was partially supported by ONR contract N000114-83-K-0249.
A divergence measure between two probability distributions or positive arrays (positive measures) is a useful tool for solving optimization problems in optimization, signal processing, machine learning, and statistical inference. The … A divergence measure between two probability distributions or positive arrays (positive measures) is a useful tool for solving optimization problems in optimization, signal processing, machine learning, and statistical inference. The Csiszar <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f</i> -divergence is a unique class of divergences having information monotonicity, from which the dual <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">alpha</i> geometrical structure with the Fisher metric is derived. The Bregman divergence is another class of divergences that gives a dually flat geometrical structure different from the <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">alpha</i> -structure in general. Csiszar gave an axiomatic characterization of divergences related to inference problems. The Kullback-Leibler divergence is proved to belong to both classes, and this is the only such one in the space of probability distributions. This paper proves that the <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">alpha</i> -divergences constitute a unique class belonging to both classes when the space of positive measures or positive arrays is considered. They are the canonical divergences derived from the dually flat geometrical structure of the space of positive measures.
Author(s): Saha, Sujayam | Advisor(s): Yu, Bin; Guntuboyina, Aditya | Abstract: This thesis documents three different contributions in statistical learning theory. They were developed with careful emphasis on addressing the … Author(s): Saha, Sujayam | Advisor(s): Yu, Bin; Guntuboyina, Aditya | Abstract: This thesis documents three different contributions in statistical learning theory. They were developed with careful emphasis on addressing the demands of modern statistical analysis upon large-scale modern datasets. The contributions concern themselves with advancements in information theory, dimension reduction and density estimation - three foundational topics in statistical theory with a plethora of applications in both practical problems and development of other aspects of statistical methodology.In Chapter \ref{chapter:fdiv}, I describe the development of an unifying treatment of the study of inequalities between $f$-divergences, which are a general class of divergences between probability measures which include as special cases many commonly used divergencesin probability, mathematical statistics and information theory such as Kullback-Leibler divergence, chi-squared divergence, squared Hellinger distance, total variation distance etc. In contrast with previous research in this area, we study the problem of obtaining sharp inequalities between $f$-divergences in full generality. In particular, our main results allow $m$ to be an arbitrary positive integer and all the divergences $D_f$ and $D_{f_1}, \dots, D_{f_m}$ to be arbitrary $f$-divergences. We show that the underlying optimization problems can be reduced to low-dimensional optimization problems and we outline methods for solving them. We also show that many of the existingresults on inequalities between $f$-divergences can be obtained as special cases of our results and we also improve on some existingnon-sharp inequalities. In Chapter \ref{chapter:srp}, I describe the development of a new dimension reduction technique specially suited for interpretable inference in supervised learning problems involving large-dimensional data. This new technique, Supervised Random Projections (SRP), is introduced with the goal of ensuring that in comparison to ordinary dimension reduction, the compressed data is more relevant to the response variable at hand in a supervised learning problem. By incorporating variable importances, we explicate that the compressed data should still accurately explain the response variable; thus lending more interpretability to the dimension reduction step. Further, variable importances ensure that even in the presence of numerous nuisance parameters, the projected data retains at least a moderate amount of information from the important variables, thus allowing said important variables a fair chance at being selected by downstream formal tests of hypotheses.In Chapter \ref{chapter:npmle}, I describe the development of several adaptivity properties of the Non-Parametric Maximum Likelihood Estimator (NPMLE) in the problem of estimating an unknown gaussian location mixture density based on independent identically distributed observations. Further, I explore the role of the NPMLE in the problem of denoising normal means, i.e. the problem of estimating unknown means based on observations. This problem has been studied widely. In this problem, I prove that the Generalized Maximum Likelihood Empirical Bayes estimator (GMLEB) approximates the Oracle Bayes estimator at adaptive parametric rates up to additional logarithmic factors in expected squared $\ell_2$ norm.
In information theory -- as well as in the adjacent fields of statistics, machine learning, artificial intelligence, signal processing and pattern recognition -- many flexibilizations of the omnipresent Kullback-Leibler information … In information theory -- as well as in the adjacent fields of statistics, machine learning, artificial intelligence, signal processing and pattern recognition -- many flexibilizations of the omnipresent Kullback-Leibler information distance (relative entropy) and of the closely related Shannon entropy have become frequently used tools. To tackle corresponding constrained minimization (respectively maximization) problems by a newly developed dimension-free bare (pure) simulation method, is the main goal of this paper. Almost no assumptions (like convexity) on the set of constraints are needed, within our discrete setup of arbitrary dimension, and our method is precise (i.e., converges in the limit). As a side effect, we also derive an innovative way of constructing new useful distances/divergences. To illustrate the core of our approach, we present numerous solved cases. The potential for widespread applicability is indicated, too; in particular, we deliver many recent references for uses of the involved distances/divergences and entropies in various different research fields (which may also serve as an interdisciplinary interface).
This chapter presents intuitive understandings for statistical learning from an information geometric point of view. We discuss a wide class of information divergence indices that express quantitatively a departure between … This chapter presents intuitive understandings for statistical learning from an information geometric point of view. We discuss a wide class of information divergence indices that express quantitatively a departure between any two probability density functions. In general, the information divergence leads to a statistical method by minimization which is based on the empirical data available. We discuss the association between the information divergence and a Riemannian metric and a pair of conjugate linear connections for a family of probability density functions. The most familiar example is the Kullback—Leibler divergence, which leads to the maximum likelihood method associated with the information metric and the pair of the exponential and mixture connections. For the class of statistical methods obtained by minimizing the divergence we discuss statistical properties focusing on its robustness. As applications to statistical learning we discuss the minimum divergence method for the principal component analysis, independent component analysis and for statistical pattern recognition.
Information geometry is the fundamental and cutting-edge discipline that explores statistical problems on Riemannian manifolds of probability distributions using the methods of differential geometry. It has been identified as the … Information geometry is the fundamental and cutting-edge discipline that explores statistical problems on Riemannian manifolds of probability distributions using the methods of differential geometry. It has been identified as the second generation of modern information theory pioneered by Shannon, and exhibits great potential in developing the field of information science and systems theory. This paper begins defining the scientific content of information geometry from the intrinsic geometrical structures of parameterized families of probability distributions as well as geometric properties of information; it points out theoretical advantages and methodological innovations of information geometry compared with classical statistics and information theory. Next, the paper briefly introduces connections between information geometry and differential geometry, and elaborates the history of information geometry as well as its applications to various areas such as neural networks, statistical inference, communications and coding, systems and control theory, physics and medical imaging, among others. In particular, the applications of information geometry to signal processing, including the latest results of geometric methods of signal detection, nonlinear parameter estimation, and filtering, are comprehensively introduced. The basic ideas and methods of information geometry are also summarized. Finally, by viewing prospects for the development of information geometry, several open problems of information geometry in applications to signal processing are proposed.
In information theory-as well as in the adjacent fields of statistics, machine learning, artificial intelligence, signal processing and pattern recognition-many flexibilizations of the omnipresent Kullback-Leibler information distance (relative entropy) and … In information theory-as well as in the adjacent fields of statistics, machine learning, artificial intelligence, signal processing and pattern recognition-many flexibilizations of the omnipresent Kullback-Leibler information distance (relative entropy) and of the closely related Shannon entropy have become frequently used tools. To tackle corresponding constrained minimization (respectively maximization) problems by a newly developed dimension-free bare (pure) simulation method, is the main goal of this paper. Almost no assumptions (like convexity) on the set of constraints are needed, within our discrete setup of arbitrary dimension, and our method is precise (i.e., converges in the limit). As a side effect, we also derive an innovative way of constructing new useful distances/divergences. To illustrate the core of our approach, we present numerous examples. The potential for widespread applicability is indicated, too; in particular, we deliver many recent references for uses of the involved distances/divergences and entropies in various different research fields (which may also serve as an interdisciplinary interface).
In information theory -- as well as in the adjacent fields of statistics, machine learning, artificial intelligence, signal processing and pattern recognition -- many flexibilizations of the omnipresent Kullback-Leibler information … In information theory -- as well as in the adjacent fields of statistics, machine learning, artificial intelligence, signal processing and pattern recognition -- many flexibilizations of the omnipresent Kullback-Leibler information distance (relative entropy) and of the closely related Shannon entropy have become frequently used tools. To tackle corresponding constrained minimization (respectively maximization) problems by a newly developed dimension-free bare (pure) simulation method, is the main goal of this paper. Almost no assumptions (like convexity) on the set of constraints are needed, within our discrete setup of arbitrary dimension, and our method is precise (i.e., converges in the limit). As a side effect, we also derive an innovative way of constructing new useful distances/divergences. To illustrate the core of our approach, we present numerous solved cases. The potential for widespread applicability is indicated, too; in particular, we deliver many recent references for uses of the involved distances/divergences and entropies in various different research fields (which may also serve as an interdisciplinary interface).
This paper proposes two linear projection methods for supervised dimension reduction using only the first and second-order statistics. The methods, each catering to a different parameter regime, are derived under … This paper proposes two linear projection methods for supervised dimension reduction using only the first and second-order statistics. The methods, each catering to a different parameter regime, are derived under the general Gaussian model by maximizing the Kullback-Leibler divergence between the two classes in the projected sample for a binary classification problem. They subsume existing linear projection approaches developed under simplifying assumptions of Gaussian distributions, such as these distributions might share an equal mean or covariance matrix. As a by-product, we establish that the multi-class linear discriminant analysis, a celebrated method for classification and supervised dimension reduction, is provably optimal for maximizing pairwise Kullback-Leibler divergence when the Gaussian populations share an identical covariance matrix. For the case when the Gaussian distributions share an equal mean, we establish conditions under which the optimal subspace remains invariant regardless of how the Kullback-Leibler divergence is defined, despite the asymmetry of the divergence measure itself. Such conditions encompass the classical case of signal plus noise, where both the signal and noise have zero mean and arbitrary covariance matrices. Experiments are conducted to validate the proposed solutions, demonstrate their superior performance over existing alternatives, and illustrate the procedure for selecting the appropriate linear projection solution.
This paper proposes two linear projection methods for supervised dimension reduction using only the first and second-order statistics. The methods, each catering to a different parameter regime, are derived under … This paper proposes two linear projection methods for supervised dimension reduction using only the first and second-order statistics. The methods, each catering to a different parameter regime, are derived under the general Gaussian model by maximizing the Kullback-Leibler divergence between the two classes in the projected sample for a binary classification problem. They subsume existing linear projection approaches developed under simplifying assumptions of Gaussian distributions, such as these distributions might share an equal mean or covariance matrix. As a by-product, we establish that the multi-class linear discriminant analysis, a celebrated method for classification and supervised dimension reduction, is provably optimal for maximizing pairwise Kullback-Leibler divergence when the Gaussian populations share an identical covariance matrix. For the case when the Gaussian distributions share an equal mean, we establish conditions under which the optimal subspace remains invariant regardless of how the Kullback-Leibler divergence is defined, despite the asymmetry of the divergence measure itself. Such conditions encompass the classical case of signal plus noise, where both signal and noise have zero mean and arbitrary covariance matrices. Experiments are conducted to validate the proposed solutions, demonstrate their superior performance over existing alternatives, and illustrate the procedure for selecting the appropriate linear projection solution.
The problem of estimating the Kullback-Leibler divergence D(P∥Q) between two unknown distributions P and Q is studied, under the assumption that the alphabet size k of the distributions can scale … The problem of estimating the Kullback-Leibler divergence D(P∥Q) between two unknown distributions P and Q is studied, under the assumption that the alphabet size k of the distributions can scale to infinity. The estimation is based on m independent samples drawn from P and n independent samples drawn from Q. It is first shown that there does not exist any consistent estimator that guarantees asymptotically small worst case quadratic risk over the set of all pairs of distributions. A restricted set that contains pairs of distributions, with density ratio bounded by a function f (k) is further considered. An augmented plug-in estimator is proposed, and its worst case quadratic risk is shown to be within a constant factor of ((k/m) + (kf (k)/n)) <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> + (log <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> f (k)/m) + ( f (k)/n), if m and n exceed a constant factor of k and kf (k), respectively. Moreover, the minimax quadratic risk is characterized to be within a constant factor of ((k/(m log k)) + (kf (k)/(n log k))) <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> + (log <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> f (k)/m) + ( f (k)/n), if m and n exceed a constant factor of k/ log(k) and kf (k)/ log k, respectively. The lower bound on the minimax quadratic risk is characterized by employing a generalized Le Cam's method. A minimax optimal estimator is then constructed by employing both the polynomial approximation and the plug-in approaches.
We develop and analyze <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">M</i> -estimation methods for divergence functionals and the likelihood ratios of two probability distributions. Our method is based on a nonasymptotic variational characterization of … We develop and analyze <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">M</i> -estimation methods for divergence functionals and the likelihood ratios of two probability distributions. Our method is based on a nonasymptotic variational characterization of <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f</i> -divergences, which allows the problem of estimating divergences to be tackled via convex empirical risk optimization. The resulting estimators are simple to implement, requiring only the solution of standard convex programs. We present an analysis of consistency and convergence for these estimators. Given conditions only on the ratios of densities, we show that our estimators can achieve optimal minimax rates for the likelihood ratio and the divergence functionals in certain regimes. We derive an efficient optimization algorithm for computing our estimates, and illustrate their convergence behavior and practical viability by simulations.
For a lower semicontinuous convex function $f: {\bf R}\to{\bf R}\cup\{+\infty\}$, ${\rm dom} f\subseteq{\bf R}_+$, we give a definition and study properties of the f-divergence of finitely additive set functions $\mu$ … For a lower semicontinuous convex function $f: {\bf R}\to{\bf R}\cup\{+\infty\}$, ${\rm dom} f\subseteq{\bf R}_+$, we give a definition and study properties of the f-divergence of finitely additive set functions $\mu$ and $\nu$ given on a measurable space $(\Omega,{\cal F})$. If f is finite on $(0,+\infty)$ and $\mu$ and $\nu$ are probability measures, our definition is equivalent to the classical definition of the f-divergence introduced by Csiszár. As an application, we obtain a result on attaining the minimum by the f-divergence over a set ${\cal Z}$ of pairs of probability measures.
"No free lunch" results state the impossibility of obtaining meaningful bounds on the error of a learning algorithm without prior assumptions and modelling, which is more or less realistic for … "No free lunch" results state the impossibility of obtaining meaningful bounds on the error of a learning algorithm without prior assumptions and modelling, which is more or less realistic for a given problem. Some models are "expensive" (strong assumptions, such as sub-Gaussian tails), others are "cheap" (simply finite variance). As it is well known, the more you pay, the more you get: in other words, the most expensive models yield the more interesting bounds. Recent advances in robust statistics have investigated procedures to obtain tight bounds while keeping the cost of assumptions minimal. The present paper explores and exhibits what the limits are for obtaining tight probably approximately correct (PAC)-Bayes bounds in a robust setting for cheap models.
We pose the problem of the optimal approximation of a given nonnegative signal $y_t$ with the scalar autoconvolution $(x*x)_t$ of a nonnegative signal $x_t$, where $x_t$ and $y_t$ are signals … We pose the problem of the optimal approximation of a given nonnegative signal $y_t$ with the scalar autoconvolution $(x*x)_t$ of a nonnegative signal $x_t$, where $x_t$ and $y_t$ are signals of equal length. The $\mathcal{I}$-divergence has been adopted as optimality criterion, being well suited to incorporate nonnegativity constraints. To find a minimizer we derive an iterative descent algorithm of the alternating minimization type. The algorithm is based on the lifting of the original problem to a larger space, a relaxation technique developed by Csisz\'ar and Tusn\'ady in [Statistics \& Decisions (S1) (1984), 205--237] which, in the present context, requires the solution of a hard partial minimization problem. We study the asymptotic behavior of the algorithm exploiting the optimality properties of the partial minimization problems and prove, among other results, that its limit points are Kuhn-Tucker points of the original minimization problem. Numerical experiments illustrate the results.
Abstract We propose a new approach to deriving quantitative mean field approximations for any probability measure $P$ on $\mathbb {R}^{n}$ with density proportional to $e^{f(x)}$, for $f$ strongly concave. We … Abstract We propose a new approach to deriving quantitative mean field approximations for any probability measure $P$ on $\mathbb {R}^{n}$ with density proportional to $e^{f(x)}$, for $f$ strongly concave. We bound the mean field approximation for the log partition function $\log \int e^{f(x)}dx$ in terms of $\sum _{i \neq j}\mathbb {E}_{Q^{*}}|\partial _{ij}f|^{2}$, for a semi-explicit probability measure $Q^{*}$ characterized as the unique mean field optimizer, or equivalently as the minimizer of the relative entropy $H(\cdot \,|\,P)$ over product measures. This notably does not involve metric-entropy or gradient-complexity concepts which are common in prior work on nonlinear large deviations. Three implications are discussed, in the contexts of continuous Gibbs measures on large graphs, high-dimensional Bayesian linear regression, and the construction of decentralized near-optimizers in high-dimensional stochastic control problems. Our arguments are based primarily on functional inequalities and the notion of displacement convexity from optimal transport.
The information value decomposition approximates a positive-valued matrix by a sequence of reduced rank matrices. A rank K approximating matrix is closest to the original matrix in the sense of … The information value decomposition approximates a positive-valued matrix by a sequence of reduced rank matrices. A rank K approximating matrix is closest to the original matrix in the sense of minimizing the discrimination between the original matrix and the approximation, over rank K matrices. The information value decomposition is analogous to the singular value decomposition with discrimination used for the discrepancy measure instead of squared error. Several properties of the information value decomposition correspond to properties of the singular value decomposition. These properties are discussed.
Un problème important en statistique est la détermination d'une loi de probabilité jointe à partir de ses lois marginales. Dans le cas bidimensionnel, les lois de probabilité marginales f1 (x) … Un problème important en statistique est la détermination d'une loi de probabilité jointe à partir de ses lois marginales. Dans le cas bidimensionnel, les lois de probabilité marginales f1 (x) et f2(y) sont reliées à la loi jointe f(x,y) par les intégrales suivant les lignes horizontale et verticale (les deux axes x et y). Ainsi, le problème de la détermination de f(x,y) connaissant f1 (x) et f2(y) est un problème inverse mal posé. En statistique la notion de copule est introduite pour obtenir une solution à ce problème. Un problème similaire en tomographie à rayon X est la reconstruction d'une image f(x,y) représentant la répartition de la densité d'une quantité à l'intérieur de l'objet à partir de ses deux projections horizontale et verticale, f1 (x) et f2(y). Il existe aussi un grand nombre de méthodes pour de tels problèmes fondées sur la transformée de Radon. Dans cet article, nous montrons les liens entre la notion de copule et celle de la tomographie à rayon X et voyons si on peut utiliser les méthodes d'un domaine à l'autre.
We investigate the convergence rate of multi-marginal optimal transport costs that are regularized with the Boltzmann-Shannon entropy, as the noise parameter $\varepsilon$ tends to $0$. We establish lower and upper … We investigate the convergence rate of multi-marginal optimal transport costs that are regularized with the Boltzmann-Shannon entropy, as the noise parameter $\varepsilon$ tends to $0$. We establish lower and upper bounds on the difference with the unregularized cost of the form $C\varepsilon\log(1/\varepsilon)+O(\varepsilon)$ for some explicit dimensional constants $C$ depending on the marginals and on the ground cost, but not on the optimal transport plans themselves. Upper bounds are obtained for Lipschitz costs or locally semi-concave costs for a finer estimate, and lower bounds for $\mathscr{C}^2$ costs satisfying some signature condition on the mixed second derivatives that may include degenerate costs, thus generalizing results previously in the two marginals case and for non-degenerate costs. We obtain in particular matching bounds in some typical situations where the optimal plan is deterministic.
Darroch and Ratcliff's iterative algorithm for minimizing $I$-divergence subject to linear constraints is equivalent to a cyclic iteration of explicitly performable $I$-projection operations. Darroch and Ratcliff's iterative algorithm for minimizing $I$-divergence subject to linear constraints is equivalent to a cyclic iteration of explicitly performable $I$-projection operations.
Abstract We consider the following question: given information on individual policyholder characteristics, how can we ensure that insurance prices do not discriminate with respect to protected characteristics, such as gender? … Abstract We consider the following question: given information on individual policyholder characteristics, how can we ensure that insurance prices do not discriminate with respect to protected characteristics, such as gender? We address the issues of direct and indirect discrimination, the latter resulting from implicit learning of protected characteristics from nonprotected ones. We provide rigorous mathematical definitions for direct and indirect discrimination, and we introduce a simple formula for discrimination-free pricing, that avoids both direct and indirect discrimination. Our formula works in any statistical model. We demonstrate its application on a health insurance example, using a state-of-the-art generalized linear model and a neural network regression model. An important conclusion is that discrimination-free pricing in general requires collection of policyholders’ discriminatory characteristics, posing potential challenges in relation to policyholder’s privacy concerns.
A broad set of sufficient conditions that guarantees the existence of the maximum entropy (maxent) distribution consistent with specified bounds on certain generalized moments is derived. Most results in the … A broad set of sufficient conditions that guarantees the existence of the maximum entropy (maxent) distribution consistent with specified bounds on certain generalized moments is derived. Most results in the literature are either focused on the minimum cross-entropy distribution or apply only to distributions with a bounded-volume support or address only equality constraints. The results of this work hold for general moment inequality constraints for probability distributions with possibly unbounded support, and the technical conditions are explicitly on the underlying generalized moment functions. An analytical characterization of the maxent distribution is also derived using results from the theory of constrained optimization in infinite-dimensional normed linear spaces. Several auxiliary results of independent interest pertaining to certain properties of convex coercive functions are also presented.
Karl Pearson's chi-square goodness-of-fit test of 1900 is considered an epochal contribution to the science in general and statistics in particular. Regarded as the first objective criterion for agreement between … Karl Pearson's chi-square goodness-of-fit test of 1900 is considered an epochal contribution to the science in general and statistics in particular. Regarded as the first objective criterion for agreement between a theory and reality, and suggested as "beginning the prelude to the modern era in statistics," it stimulated a broadband enquiry into the basics of statistics and led to numerous concepts and ideas which are now common fare in statistical science. Over the decades of the twentieth century the goodness-of-fit has become a substantial field of statistical science of both theoretical and applied importance, and has led to development of a variety of statistical tools. The characterization theorems in probability and statistics, the other topic of our focus, are widely appreciated for their role in clarifying the structure of the families of probability distributions. The purpose of this paper is twofold. The first is to demonstrate that characterization theorems can be natural, logical and effective starting points for constructing goodness-of-fit tests. Towards this end, several entropy and independence characterizations of the normal and the inverse gaussian (IG) distributions, which have resulted in goodness-of-fit tests, are used. The second goal of this paper is to show that the interplay between distributional characterizations and goodness-of-fit assessment continues to be a stimulus for new discoveries and ideas. The point is illustrated using the new concepts of IG symmetry, IG skewness and IG kurtosis, which resulted from goodness-of-fit investigations and have substantially expanded our understanding of the striking and intriguing analogies between the IG and normal families.
Fully Bayesian inference in the presence of unequal probability sampling requires stronger structural assumptions on the data-generating distribution than frequentist semiparametric methods, but offers the potential for improved small-sample inference … Fully Bayesian inference in the presence of unequal probability sampling requires stronger structural assumptions on the data-generating distribution than frequentist semiparametric methods, but offers the potential for improved small-sample inference and convenient evidence synthesis. We demonstrate that the Bayesian exponentially tilted empirical likelihood can be used to combine the practical benefits of Bayesian inference with the robustness and attractive large-sample properties of frequentist approaches. Estimators defined as the solutions to unbiased estimating equations can be used to define a semiparametric model through the set of corresponding moment constraints. We prove Bernstein-von Mises theorems which show that the posterior constructed from the resulting exponentially tilted empirical likelihood becomes approximately normal, centred at the chosen estimator with matching asymptotic variance; thus, the posterior has properties analogous to those of the estimator, such as double robustness, and the frequentist coverage of any credible set will be approximately equal to its credibility. The proposed method can be used to obtain modified versions of existing estimators with improved properties, such as guarantees that the estimator lies within the parameter space. Unlike existing Bayesian proposals, our method does not prescribe a particular choice of prior or require posterior variance correction, and simulations suggest that it provides superior performance in terms of frequentist criteria.
Erwin Schroedinger posed, and to a large extent solved in 1931/32 the problem of finding the most likely random evolution between two continuous probability distributions. This article considers this problem … Erwin Schroedinger posed, and to a large extent solved in 1931/32 the problem of finding the most likely random evolution between two continuous probability distributions. This article considers this problem in the case when only samples of the two distributions are available. A novel iterative procedure is proposed, inspired by Fortet-Sinkhorn type algorithms. Since only samples of the marginals are available, the new approach features constrained maximum likelihood estimation in place of the nonlinear boundary couplings, and importance sampling to propagate the functions $φ$ and $\hatφ$ solving the Schroedinger system. This method is well-suited to high-dimensional settings, where introducing grids leads to numerically unfeasible or unreliable methods. The methodology is illustrated in two applications: entropic interpolation of two-dimensional Gaussian mixtures, and the estimation of integrals through a variation of importance sampling.
We present an approach to statistical data modeling and exploratory data analysis called `LP Statistical Data Science.' It aims to generalize and unify traditional and novel statistical measures, methods, and … We present an approach to statistical data modeling and exploratory data analysis called `LP Statistical Data Science.' It aims to generalize and unify traditional and novel statistical measures, methods, and exploratory tools. This article outlines fundamental concepts along with real-data examples to illustrate how the `LP Statistical Algorithm' can systematically tackle different varieties of data types, data patterns, and data structures under a coherent theoretical framework. A fundamental role is played by specially designed orthonormal basis of a random variable X for linear (Hilbert space theory) representation of a general function of X, such as $\mbox{E}[Y \mid X]$.
Integral functionals based on convex normal integrands are minimized subject to finitely many moment constraints. The integrands are finite on the positive and infinite on the negative numbers, strictly convex … Integral functionals based on convex normal integrands are minimized subject to finitely many moment constraints. The integrands are finite on the positive and infinite on the negative numbers, strictly convex but not necessarily differentiable. The minimization is viewed as a primal problem and studied together with a dual one in the framework of convex duality. The effective domain of the value function is described by a conic core, a modification of the earlier concept of convex core. Minimizers and generalized minimizers are explicitly constructed from solutions of modified dual problems, not assuming the primal constraint qualification. A generalized Pythagorean identity is presented using Bregman distance and a correction term for lack of essential smoothness in integrands. Results are applied to minimization of Bregman distances. Existence of a generalized dual solution is established whenever the dual value is finite, assuming the dual constraint qualification. Examples of `irregular' situations are included, pointing to the limitations of generality of certain key results.
Let $\zeta$ be chosen at random on the surface of the $p$-sphere in $\mathbb{R}^n, 0_{p,n} := \{x \in \mathbb{R}^n: \sum^n_{i = 1}|x_i|^p = n\}$. If $p = 2$, then the … Let $\zeta$ be chosen at random on the surface of the $p$-sphere in $\mathbb{R}^n, 0_{p,n} := \{x \in \mathbb{R}^n: \sum^n_{i = 1}|x_i|^p = n\}$. If $p = 2$, then the first $k$ components $\zeta_1,\ldots, \zeta_k$ are, for $k$ fixed, in the limit as $n\rightarrow\infty$ independent standard normal. Considering the general case $p > 0$, the same phenomenon appears with a distribution $F_p$ in an exponential class $\mathscr{E}. F_p$ can be characterized by the distribution of quotients of sums, by conditional distributions and by a maximum entropy condition. These characterizations have some interesting stability properties. Some discrete versions of this problem and some applications to de Finetti-type theorems are discussed.
In this article, we develop a Bayesian semiparametric analysis of moment condition models by casting the problem within the exponentially tilted empirical likelihood (ETEL) framework. We use this framework to … In this article, we develop a Bayesian semiparametric analysis of moment condition models by casting the problem within the exponentially tilted empirical likelihood (ETEL) framework. We use this framework to develop a fully Bayesian analysis of correctly and misspecified moment condition models. We show that even under misspecification, the Bayesian ETEL posterior distribution satisfies the Bernstein–von Mises (BvM) theorem. We also develop a unified approach based on marginal likelihoods and Bayes factors for comparing different moment-restricted models and for discarding any misspecified moment restrictions. Computation of the marginal likelihoods is by the method of Chib (1995 Chib, S. (1995), "Marginal Likelihood from the Gibbs Output," Journal of the American Statistical Association, 90, 1313–1321.[Taylor & Francis Online], [Web of Science ®] , [Google Scholar]) as extended to Metropolis–Hastings samplers in Chib and Jeliazkov in 2001 Chib, S., and Jeliazkov, I. (2001), "Marginal Likelihood From the Metropolis-Hastings Output," Journal of the American Statistical Association, 96, 270–281.[Taylor & Francis Online], [Web of Science ®] , [Google Scholar]. We establish the model selection consistency of the marginal likelihood and show that the marginal likelihood favors the model with the minimum number of parameters and the maximum number of valid moment restrictions. When the models are misspecified, the marginal likelihood model selection procedure selects the model that is closer to the (unknown) true data-generating process in terms of the Kullback–Leibler divergence. The ideas and results in this article broaden the theoretical underpinning and value of the Bayesian ETEL framework with many practical applications. The discussion is illuminated through several examples. Supplementary materials for this article are available online.
We establish an equivalence between two conceptually different methods of signal estimation under modeling uncertainty, viz., set-theoretic (ST) estimation and maximum entropy (maxent) MAP estimation. The first method assumes constraints … We establish an equivalence between two conceptually different methods of signal estimation under modeling uncertainty, viz., set-theoretic (ST) estimation and maximum entropy (maxent) MAP estimation. The first method assumes constraints on the signal to be estimated, and the second assumes constraints on a probability distribution for the signal. We provide broad conditions under which these two estimation paradigms produce the same signal estimate. We also show how the maxent formalism can be used to provide solutions to three important problems: how to select sizes of constraint sets in ST estimation (the analysis highlights the role of shrinkage); how to choose the values of parameters in regularized restoration when using multiple regularization functionals; how to trade off model complexity and goodness of fit in a model selection problem.
The purpose of this article is to derive and illustrate a method for fitting models involving both convex and log-convex constraints on the probability vector(s) of a (product) multinomial distribution. … The purpose of this article is to derive and illustrate a method for fitting models involving both convex and log-convex constraints on the probability vector(s) of a (product) multinomial distribution. We give a two-step algorithm to obtain maximum likelihood estimates of the probability vector(s) and show that it is guaranteed to converge to the true solution. Some examples are discussed which illustrate the procedure.
Prior specification for non-parametric Bayesian inference involves the difficult task of quantifying prior knowledge about a parameter of high, often infinite, dimension. A statistician is unlikely to have informed opinions … Prior specification for non-parametric Bayesian inference involves the difficult task of quantifying prior knowledge about a parameter of high, often infinite, dimension. A statistician is unlikely to have informed opinions about all aspects of such a parameter but will have real information about functionals of the parameter, such as the population mean or variance. The paper proposes a new framework for non-parametric Bayes inference in which the prior distribution for a possibly infinite dimensional parameter is decomposed into two parts: an informative prior on a finite set of functionals, and a non-parametric conditional prior for the parameter given the functionals. Such priors can be easily constructed from standard non-parametric prior distributions in common use and inherit the large support of the standard priors on which they are based. Additionally, posterior approximations under these informative priors can generally be made via minor adjustments to existing Markov chain approximation algorithms for standard non-parametric prior distributions. We illustrate the use of such priors in the context of multivariate density estimation using Dirichlet process mixture models, and in the modelling of high dimensional sparse contingency tables.
We prove several fundamental statistical bounds for entropic OT with the squared Euclidean cost between subgaussian probability measures in arbitrary dimension. First, through a new sample complexity result we establish … We prove several fundamental statistical bounds for entropic OT with the squared Euclidean cost between subgaussian probability measures in arbitrary dimension. First, through a new sample complexity result we establish the rate of convergence of entropic OT for empirical measures. Our analysis improves exponentially on the bound of Genevay et al. (2019) and extends their work to unbounded measures. Second, we establish a central limit theorem for entropic OT, based on techniques developed by Del Barrio and Loubes (2019). Previously, such a result was only known for finite metric spaces. As an application of our results, we develop and analyze a new technique for estimating the entropy of a random variable corrupted by gaussian noise.
Let X be a locally bounded semimartingale. Using the theory of \textit{BMO}-martingales we give a sufficient criterion for a martingale measure for X to minimize relative entropy among all martingale … Let X be a locally bounded semimartingale. Using the theory of \textit{BMO}-martingales we give a sufficient criterion for a martingale measure for X to minimize relative entropy among all martingale measures. This is applied to prove convergence of the q-optimal martingale measure to the minimal entropy martingale measure in entropy for $q\downarrow 1$ under the assumption that X is continuous and that the density process of some equivalent martingale measure satisfies a reverse $\mathit{LLogL}$\small -inequality.
When there are a number of stochastic models in the form of probability distributions, one needs to integrate them. Mixtures of distributions are frequently used, but exponential mixtures also provide … When there are a number of stochastic models in the form of probability distributions, one needs to integrate them. Mixtures of distributions are frequently used, but exponential mixtures also provide a good means of integration. This letter proposes a one-parameter family of integration, called alpha-integration, which includes all of these well-known integrations. These are generalizations of various averages of numbers such as arithmetic, geometric, and harmonic averages. There are psychophysical experiments that suggest that alpha-integrations are used in the brain. The alpha-divergence between two distributions is defined, which is a natural generalization of Kullback-Leibler divergence and Hellinger distance, and it is proved that alpha-integration is optimal in the sense of minimizing alpha-divergence. The theory is applied to generalize the mixture of experts and the product of experts to the alpha-mixture of experts. The alpha-predictive distribution is also stated in the Bayesian framework.
This paper extends some geometric properties of a one-parameter family of relative entropies. These arise as redundancies when cumulants of compressed lengths are considered instead of expected compressed lengths. These … This paper extends some geometric properties of a one-parameter family of relative entropies. These arise as redundancies when cumulants of compressed lengths are considered instead of expected compressed lengths. These parametric relative entropies are a generalization of the Kullback-Leibler divergence. They satisfy the Pythagorean property and behave like squared distances. This property, which was known for finite alphabet spaces, is now extended for general measure spaces. Existence of projections onto convex and certain closed sets is also established. Our results may have applications in the R\'enyi entropy maximization rule of statistical physics.
We propose non-nested hypotheses tests for conditional moment restriction models based on the method of generalized empirical likelihood (GEL). By utilizing the implied GEL probabilities from a sequence of unconditional … We propose non-nested hypotheses tests for conditional moment restriction models based on the method of generalized empirical likelihood (GEL). By utilizing the implied GEL probabilities from a sequence of unconditional moment restrictions that contains equivalent information of the conditional moment restrictions, we construct Kolmogorov-Smirnov and Cramer-von Mises type moment encompassing tests. Advantages of our tests over Otsu and Whang's (2007) tests are: (i) they are free from smoothing parameters, (ii) they can be applied to weakly dependent data, and (iii) they allow non-smooth moment functions. We derive the null distributions, validity of a bootstrap procedure, and local and global power properties of our tests. The simulation results show that our tests have reasonable size and power performance in finite samples.
We study a class of missingness mechanisms, called sequentially additive nonignorable, for modeling multivariate data with item nonresponse. These mechanisms explicitly allow the probability of nonresponse for each variable to … We study a class of missingness mechanisms, called sequentially additive nonignorable, for modeling multivariate data with item nonresponse. These mechanisms explicitly allow the probability of nonresponse for each variable to depend on the value of that variable, thereby representing nonignorable missingness mechanisms. These missing data models are identified by making use of auxiliary information on marginal distributions, such as marginal probabilities for multivariate categorical variables or moments for numeric variables. We present theory proving identification results, and illustrate the use of these mechanisms in an application.
We introduce generalized notions of a divergence function and a Fisher information matrix. We propose to generalize the notion of an exponential family of models by reformulating it in terms … We introduce generalized notions of a divergence function and a Fisher information matrix. We propose to generalize the notion of an exponential family of models by reformulating it in terms of the Fisher information matrix. Our methods are those of information geometry. The context is general enough to include applications from outside statistics.
Consider the problem of nonparametric estimation of an unknown $\beta$-H\"older smooth density $p_{XY}$ at a given point, where $X$ and $Y$ are both $d$ dimensional. An infinite sequence of i.i.d.\ … Consider the problem of nonparametric estimation of an unknown $\beta$-H\"older smooth density $p_{XY}$ at a given point, where $X$ and $Y$ are both $d$ dimensional. An infinite sequence of i.i.d.\ samples $(X_i,Y_i)$ are generated according to this distribution, and two terminals observe $(X_i)$ and $(Y_i)$, respectively. They are allowed to exchange $k$ bits either in oneway or interactively in order for Bob to estimate the unknown density. We show that the minimax mean square risk is order $\left(\frac{k}{\log k} \right)^{-\frac{2\beta}{d+2\beta}}$ for one-way protocols and $k^{-\frac{2\beta}{d+2\beta}}$ for interactive protocols. The logarithmic improvement is nonexistent in the parametric counterparts, and therefore can be regarded as a consequence of nonparametric nature of the problem. Moreover, a few rounds of interactions achieve the interactive minimax rate: the number of rounds can grow as slowly as the super-logarithm (i.e., inverse tetration) of $k$. The proof of the upper bound is based on a novel multi-round scheme for estimating the joint distribution of a pair of biased Bernoulli variables, and the lower bound is built on a sharp estimate of a symmetric strong data processing constant for biased Bernoulli variables.
In probability theory, Bayesian statistics, artificial intelligence and database theory the minimum cross-entropy principle is often used to estimate a distribution with a given set $P$ of marginal distributions under … In probability theory, Bayesian statistics, artificial intelligence and database theory the minimum cross-entropy principle is often used to estimate a distribution with a given set $P$ of marginal distributions under the proportionality assumption with respect to a given ``prior'' distribution $q$. Such an estimation problem admits a solution if and only if there exists an extension of $P$ that is dominated by $q$. In this paper we consider the case that $q$ is not given explicitly, but is specified as the maximum-entropy extension of an auxiliary set $Q$ of distributions. There are three problems that naturally arise: (1) the existence of an extension of a distribution set (such as $P$ and $Q$), (2) the existence of an extension of $P$ that is dominated by the maximum entropy extension of $Q$, (3) the numeric computation of the minimum cross-entropy extension of $P$ with respect to the maximum entropy extension of $Q$. In the spirit of a divide-and-conquer approach, we prove that, for each of the three above-mentioned problems, the global solution can be easily obtained by combining the solutions to subproblems defined at node level of a suitable tree.
Deming and Stephan (1940) first proposed the use of an iterative proportional fitting procedure to estimate cell probabilities in a contingency table subject to certain marginal constraints. In this paper … Deming and Stephan (1940) first proposed the use of an iterative proportional fitting procedure to estimate cell probabilities in a contingency table subject to certain marginal constraints. In this paper we first relate this procedure to a variety of sources and a variety of statistical problems. We then describe the procedure geometrically for two-way contingency tables using the concepts presented in Fienberg (1968). This geometrical description leads to a rather simple proof of the convergence of the iterative procedure. We conclude the paper with a discussion of extensions to multi-dimensional tables and to tables with some zero entries.
Tests of marginal homogeneity in a two-way contingency table given by [1], [3], and [13] do not seem to lend themselves easily to extension to the problem of $m$-way marginal … Tests of marginal homogeneity in a two-way contingency table given by [1], [3], and [13] do not seem to lend themselves easily to extension to the problem of $m$-way marginal homogeneity in an $N$-way $r \times r \times \cdots \times r$ contingency table, $m < N$. The principle of minimum discrimination information estimation and the associated minimum discrimination information statistic applied in [5] to the problem of marginal homogeneity in an $r \times r$ contingency table can be easily extended to the case of a multidimensional contingency table. Estimates of the cell entries under the hypotheses of $m$-way marginal homogeneity are given. Relationships among the tests of homogeneity for $m$-way, $m = 1, 2, \cdots, N - 1$, marginals are given by an analysis of information. Numerical results are given for two sample $3 \times 3 \times 3$ tables, and two $5 \times 5$ tables.
A recent result of Sinkhorn [3] states that for any square matrix A of positive elements, there exist diagonal matrices Dι and D 2 with positive diagonal elements for which … A recent result of Sinkhorn [3] states that for any square matrix A of positive elements, there exist diagonal matrices Dι and D 2 with positive diagonal elements for which Dι A D 2 is doubly stochastic.In the present paper, this result is generalized to a wide class of positive operators as follows.Let (Ω 9 % λ) be the product space of two probability measure spaces (βu %u λϊ).Let / denote a measurable function on (42, SI) for which there exist constants c 9 C such that 0<c^/^C<oo.Let K be any nonnegative, twodimensional real valued continuous function defined on the open unit square, (0,1) X (0,1), for which the functions K(u, ) and K( ,v) are strictly increasing functions with strict ranges (0,oo) for each u or v in (0,1).Then there exist functions h: Ωι -»Eί and g: Ω 2 -> EΊ such that j\x, v) K(h(x), g(y)) dλ 2 (y) = 1 = ( f(u,y) K(h{u\ almost everywhere -(λ).Let (Ω, 2ί, λ) be the product space of two probability measure spaces (Ω i9 %, λ<).Let / denote a measurable function on (Ω f 9X) for which there exist constants c,C such that (11) 0<c^f^C< oo .Let K be any nonnegative, real valued continuous function defined on the open unit square, (0,1) x (0,1), for which the functions K(u,') and K( ,v) are strictly increasing functions with strict ranges (0, °o) for each u or v in (0,1).In what follows, h and g will denote measurable, real valued, functions defined on Ω u and Ω 2 , respectively.Whenever well defined, set R(x: Kg) = \ f(x,v) K(h(x), g(v))d7φ) JΩ 2 (2) f C(y: Kg) = f(u,y) K(h(u), g(y))d\ L (u) hi for (x,y) e Ω.
In its simplest formulation the problem considered is to estimate the cell probabilities Pij of an r × c contingency table for which the marginal probabilities Pi and Pj are … In its simplest formulation the problem considered is to estimate the cell probabilities Pij of an r × c contingency table for which the marginal probabilities Pi and Pj are known and fixed, so as to minimize ΣΣpij In (pij/πij), where πij are the corresponding entries in a given contingency table. An iterative procedure is given for determining the estimates and it is shown that the estimates are BAN, and that the iterative procedure is convergent. A summary of results for a four-way contingency table is given. An illustrative example is given.
The principle of maximum entropy, together with some generalizations, is interpreted as a heuristic principle for the generation of null hypotheses. The main application is to $m$-dimensional population contingency tables, … The principle of maximum entropy, together with some generalizations, is interpreted as a heuristic principle for the generation of null hypotheses. The main application is to $m$-dimensional population contingency tables, with the marginal totals given down to dimension $m - r$ ("restraints of the $r$th order"). The principle then leads to the null hypothesis of no "$r$th-order interaction." Significance tests are given for testing the hypothesis of no $r$th-order or higher-order interaction within the wider hypothesis of no $s$th-order or higher-order interaction, some cases of which have been treated by Bartlett and by Roy and Kastenbaum. It is shown that, if a complete set of $r$th-order restraints are given, then the hypothesis of the vanishing of all $r$th-order and higher-order interactions leads to a unique set of cell probabilities, if the restraints are consistent, but not only just consistent. This confirms and generalizes a recent conjecture due to Darroch. A kind of duality between maximum entropy and maximum likelihood is proved. Some relationships between maximum entropy, interactions, and Markov chains are proved.
1. Fundamental notions 2. Direct probabilities 3. Estimation problems 4. Approximate methods and simplifications 5. Significance tests: one new parameter 6. Significance tests: various complications 7. Frequency definitions and direct … 1. Fundamental notions 2. Direct probabilities 3. Estimation problems 4. Approximate methods and simplifications 5. Significance tests: one new parameter 6. Significance tests: various complications 7. Frequency definitions and direct methods 8. General questions