A General Class of Coefficients of Divergence of One Distribution from Another

Type: Article
Publication Date: 1966-01-01
Citations: 1205
DOI: https://doi.org/10.1111/j.2517-6161.1966.tb00626.x

Abstract

Summary Let p1 and p2 be two probability measures on the same space and let φ be the generalized Radon-Nikodym derivative of p2 with respect to p1. If C is a continuous convex function of a real variable such that the p1-expectation (generalized as in Section 3) of C(φ) provides a reasonable coefficient of the p1-dispersion of φ, then this expectation has basic properties which it is natural to demand of a coefficient of divergence of p2 from p1. A general class of coefficients of divergence is generated in this way and it is shown that various available measures of divergence, distance, discriminatory information, etc., are members of this class.

Locations

  • Journal of the Royal Statistical Society Series B (Statistical Methodology)
A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content. A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.
Summary The target of this paper is to offer a compact review of the so called distance methods in Statistics, which cover all the known estimation methods. Based on this … Summary The target of this paper is to offer a compact review of the so called distance methods in Statistics, which cover all the known estimation methods. Based on this fact we propose a new step, to adopt from Information Theory, the divergence measures, as distance methods, to compare two distributions, and not only to investigate if the means or the variances of the distributions are equal. Some useful results towards this line of thought are presented, adopting a compact form for all known divergence measures, and are appropriately analyzed for Biometrical, and not only, applications.
Comparing probability distributions is an indispensable and ubiquitous task in machine learning and statistics. The most common way to compare a pair of Borel probability measures is to compute a … Comparing probability distributions is an indispensable and ubiquitous task in machine learning and statistics. The most common way to compare a pair of Borel probability measures is to compute a metric between them, and by far the most widely used notions of metric are the Wasserstein metric and the total variation metric. The next most common way is to compute a divergence between them, and in this case almost every known divergences such as those of Kullback--Leibler, Jensen--Shannon, R\'enyi, and many more, are special cases of the $f$-divergence. Nevertheless these metrics and divergences may only be computed, in fact, are only defined, when the pair of probability measures are on spaces of the same dimension. How would one quantify, say, a KL-divergence between the uniform distribution on the interval $[-1,1]$ and a Gaussian distribution on $\mathbb{R}^3$? We show that these common notions of metrics and divergences give rise to natural distances between Borel probability measures defined on spaces of different dimensions, e.g., one on $\mathbb{R}^m$ and another on $\mathbb{R}^n$ where $m, n$ are distinct, so as to give a meaningful answer to the previous question.
An important tool to quantify the likeness of two probability measures are f-divergences, which have seen widespread application in statistics and information theory. An example is the total variation, which … An important tool to quantify the likeness of two probability measures are f-divergences, which have seen widespread application in statistics and information theory. An example is the total variation, which plays an exceptional role among the f-divergences. It is shown that every f-divergence is bounded from below by a monotonous function of the total variation. Under appropriate regularity conditions, this function is shown to be monotonous. Remark: The proof of the main proposition is relatively easy, whence it is highly likely that the result is known. The author would be very grateful for any information regarding references or related work.
In this paper, we extend and overview wide families of Alpha-, Beta- and Gamma-divergences and discuss their fundamental properties. In literature usually only one single asymmetric (Alpha, Beta or Gamma) … In this paper, we extend and overview wide families of Alpha-, Beta- and Gamma-divergences and discuss their fundamental properties. In literature usually only one single asymmetric (Alpha, Beta or Gamma) divergence is considered. We show in this paper that there exist families of such divergences with the same consistent properties. Moreover, we establish links and correspondences among these divergences by applying suitable nonlinear transformations. For example, we can generate the Beta-divergences directly from Alpha-divergences and vice versa. Furthermore, we show that a new wide class of Gamma-divergences can be generated not only from the family of Beta-divergences but also from a family of Alpha-divergences. The paper bridges these divergences and shows also their links to Tsallis and Rényi entropies. Most of these divergences have a natural information theoretic interpretation.
This paper introduces scaled Bregman distances of probability distributions which admit nonuniform contributions of observed events. They are introduced in a general form covering not only the distances of discrete … This paper introduces scaled Bregman distances of probability distributions which admit nonuniform contributions of observed events. They are introduced in a general form covering not only the distances of discrete and continuous stochastic observations, but also the distances of random processes and signals. It is shown that the scaled Bregman distances extend not only the classical ones studied in the previous literature, but also the information divergence and the related wider class of convex divergences of probability measures. An information-processing theorem is established too, but only in the sense of invariance w.r.t. statistically sufficient transformations and not in the sense of universal monotonicity. Pathological situations where coding can increase the classical Bregman distance are illustrated by a concrete example. In addition to the classical areas of application of the Bregman distances and convex divergences such as recognition, classification, learning, and evaluation of proximity of various features and signals, the paper mentions a new application in 3-D exploratory data analysis. Explicit expressions for the scaled Bregman distances are obtained in general exponential families, with concrete applications in the binomial, Poisson, and Rayleigh families, and in the families of exponential processes such as the Poisson and diffusion processes including the classical examples of the Wiener process and geometric Brownian motion.
Abstract F-divergences are a class of functions that quantify the difference between two probability distributions. They are widely used in statistics, machine learning, and information theory. New F-divergences typically refer … Abstract F-divergences are a class of functions that quantify the difference between two probability distributions. They are widely used in statistics, machine learning, and information theory. New F-divergences typically refer to variations or extensions of existing F-divergences, introducing additional parameters to customize their behavior or properties. The parametric properties of a new F-divergence measure would depend on the specific form of the divergence and the parameters introduced. In this paper, discussed the parametric properties of new f-divergence functional. This measure is also known as Jain and Saraswat measure which is established in 2012 and 2013. The applications of this measure in the form of series using the properties of convexity have been established.
The $\alpha$-divergences include the well-known Kullback-Leibler divergence, Hellinger distance and $\chi^2$-divergence. In this paper, we derive differential and integral relations between the $\alpha$-divergences that are generalizations of the relation between … The $\alpha$-divergences include the well-known Kullback-Leibler divergence, Hellinger distance and $\chi^2$-divergence. In this paper, we derive differential and integral relations between the $\alpha$-divergences that are generalizations of the relation between the Kullback-Leibler divergence and the $\chi^2$-divergence. We also show tight lower bounds for the $\alpha$-divergences under given means and variances. In particular, we show a necessary and sufficient condition such that the binary divergences, which are divergences between probability measures on the same $2$-point set, always attain lower bounds. Kullback-Leibler divergence, Hellinger distance, and $\chi^2$-divergence satisfy this condition.
This paper states that most commonly used minimum divergence estimators are MLEs for suited generalized bootstrapped sampling schemes. Optimality in the sense of Bahadur for associated tests of fit under … This paper states that most commonly used minimum divergence estimators are MLEs for suited generalized bootstrapped sampling schemes. Optimality in the sense of Bahadur for associated tests of fit under such sampling is considered.
Nonparametric two‐sample problems are extremely important for applications in different applied disciplines. We define a general MI based on the φ divergences and use its estimate to propose a new … Nonparametric two‐sample problems are extremely important for applications in different applied disciplines. We define a general MI based on the φ divergences and use its estimate to propose a new general class of nonparametric two sample tests for continuous distributions. We derive the asymptotic distribution of the estimates of φ ‐divergence‐based MI ( φ DMI) under the assumption of independence in the hybrid setup of one binary and one continuous random variables. Additionally, for finite sample cases, we describe an algorithm for obtaining the bootstrap‐based critical value of our proposed two‐sample test based on the estimated φ DMI. We demonstrate through extensive simulations that the proposed class of tests work exceptionally well in many situations and can detect differences where other two‐sample tests fail. Finally, we analyze an application of our proposed tests to assess a solution to information leakage in e‐passport data.
Mutual information is a measure of the dependence between random variables that has been used successfully in myriad applications in many fields. Generalized mutual information measures that go beyond classical … Mutual information is a measure of the dependence between random variables that has been used successfully in myriad applications in many fields. Generalized mutual information measures that go beyond classical Shannon mutual information have also received much interest in these applications. We derive the mean squared error convergence rates of kernel density-based plug-in estimators of general mutual information measures between two multidimensional random variables $\mathbf{X}$ and $\mathbf{Y}$ for two cases: 1) $\mathbf{X}$ and $\mathbf{Y}$ are continuous; 2) $\mathbf{X}$ and $\mathbf{Y}$ may have any mixture of discrete and continuous components. Using the derived rates, we propose an ensemble estimator of these information measures called GENIE by taking a weighted sum of the plug-in estimators with varied bandwidths. The resulting ensemble estimators achieve the $1/N$ parametric mean squared error convergence rate when the conditional densities of the continuous variables are sufficiently smooth. To the best of our knowledge, this is the first nonparametric mutual information estimator known to achieve the parametric convergence rate for the mixture case, which frequently arises in applications (e.g. variable selection in classification). The estimator is simple to implement and it uses the solution to an offline convex optimization problem and simple plug-in estimators. A central limit theorem is also derived for the ensemble estimators and minimax rates are derived for the continuous case. We demonstrate the ensemble estimator for the mixed case on simulated data and apply the proposed estimator to analyze gene relationships in single cell data.
The Kullback-Leibler (KL) divergence is a discrepancy measure between probability distribution that plays a central role in information theory, statistics and machine learning. While there are numerous methods for estimating … The Kullback-Leibler (KL) divergence is a discrepancy measure between probability distribution that plays a central role in information theory, statistics and machine learning. While there are numerous methods for estimating this quantity from data, a limit distribution theory which quantifies fluctuations of the estimation error is largely obscure. In this paper, we close this gap by identifying sufficient conditions on the population distributions for the existence of distributional limits and characterizing the limiting variables. These results are used to derive one- and two-sample limit theorems for Gaussian-smoothed KL divergence, both under the null and the alternative. Finally, an application of the limit distribution result to auditing differential privacy is proposed and analyzed for significance level and power against local alternatives.
This paper presents a numerical analysis for the time-implicit numerical approximation of the Boltzmann equation based on a moment system approximation in velocity dependence and a discontinuous Galerkin finite-element (DGFE) … This paper presents a numerical analysis for the time-implicit numerical approximation of the Boltzmann equation based on a moment system approximation in velocity dependence and a discontinuous Galerkin finite-element (DGFE) approximation in time and position dependence. The implicit nature of the DGFE moment method in position and time dependence provides a robust numerical algorithm for the approximation of solutions of the Boltzmann equation. The closure relation for the moment systems derives from minimization of a suitable φ-divergence. We present sufficient conditions such that this divergence-based closure yields a hierarchy of tractable symmetric hyperbolic moment systems that retain the fundamental structural properties of the Boltzmann equation. The resulting combined space-time DGFE moment method corresponds to a Galerkin approximation of the Boltzmann equation in renormalized form. We propose a renormalization map that facilitates the approximation of multidimensional problems in an implicit manner. Moreover, upper and lower entropy bounds are derived for the proposed DGFE moment scheme. Numerical results for benchmark problems governed by the BGK-Boltzmann equation are presented to illustrate the approximation properties of the new DGFE moment method, and it is shown that the proposed velocity-space-time DGFE moment method is entropy bounded.
The goal of Optimal Transport (OT) is to define geometric tools that are useful to compare probability distributions. Their use dates back to 1781. Recent years have witnessed a new … The goal of Optimal Transport (OT) is to define geometric tools that are useful to compare probability distributions. Their use dates back to 1781. Recent years have witnessed a new revolution in the spread of OT, thanks to the emergence of approximate solvers that can scale to sizes and dimensions that are relevant to data sciences. Thanks to this newfound scalability, OT is being increasingly used to unlock various problems in imaging sciences (such as color or texture processing), computer vision and graphics (for shape manipulation) or machine learning (for regression, classification and density fitting). This monograph reviews OT with a bias toward numerical methods and their applications in data sciences, and sheds lights on the theoretical properties of OT that make it particularly useful for some of these applications. Computational Optimal Transport presents an overview of the main theoretical insights that support the practical effectiveness of OT before explaining how to turn these insights into fast computational schemes. Written for readers at all levels, the authors provide descriptions of foundational theory at two-levels. Generally accessible to all readers, more advanced readers can read the specially identified more general mathematical expositions of optimal transport tailored for discrete measures. Furthermore, several chapters deal with the interplay between continuous and discrete measures, and are thus targeting a more mathematically-inclined audience. This monograph will be a valuable reference for researchers and students wishing to get a thorough understanding of Computational Optimal Transport, a mathematical gem at the interface of probability, analysis and optimization.
An attempt is made to determine the logically consistent rules for selecting a vector from any feasible set defined by linear constraints, when either all $n$-vectors or those with positive … An attempt is made to determine the logically consistent rules for selecting a vector from any feasible set defined by linear constraints, when either all $n$-vectors or those with positive components or the probability vectors are permissible. Some basic postulates are satisfied if and only if the selection rule is to minimize a certain function which, if a "prior guess" is available, is a measure of distance from the prior guess. Two further natural postulates restrict the permissible distances to the author's $f$-divergences and Bregman's divergences, respectively. As corollaries, axiomatic characterizations of the methods of least squares and minimum discrimination information are arrived at. Alternatively, the latter are also characterized by a postulate of composition consistency. As a special case, a derivation of the method of maximum entropy from a small set of natural axioms is obtained.
The paper deals with the f-divergences of Csiszar generalizing the discrimination information of Kullback, the total variation distance, the Hellinger divergence, and the Pearson divergence. All basic properties of f-divergences … The paper deals with the f-divergences of Csiszar generalizing the discrimination information of Kullback, the total variation distance, the Hellinger divergence, and the Pearson divergence. All basic properties of f-divergences including relations to the decision errors are proved in a new manner replacing the classical Jensen inequality by a new generalized Taylor expansion of convex functions. Some new properties are proved too, e.g., relations to the statistical sufficiency and deficiency. The generalized Taylor expansion also shows very easily that all f-divergences are average statistical informations (differences between prior and posterior Bayes errors) mutually differing only in the weights imposed on various prior distributions. The statistical information introduced by De Groot and the classical information of Shannon are shown to be extremal cases corresponding to alpha=0 and alpha=1 in the class of the so-called Arimoto alpha-informations introduced in this paper for 0<alpha<1 by means of the Arimoto alpha-entropies. Some new examples of f-divergences are introduced as well, namely, the Shannon divergences and the Arimoto alpha-divergences leading for alphauarr1 to the Shannon divergences. Square roots of all these divergences are shown to be metrics satisfying the triangle inequality. The last section introduces statistical tests and estimators based on the minimal f-divergence with the empirical distribution achieved in the families of hypothetic distributions. For the Kullback divergence this leads to the classical likelihood ratio test and estimator
An alternate formulation of the robust hypothesis testing problem is considered in which robustness is defined in terms of a maximin game with a statistical distance criterion as a payoff … An alternate formulation of the robust hypothesis testing problem is considered in which robustness is defined in terms of a maximin game with a statistical distance criterion as a payoff function. This distance criterion, which is a generalized version of signal-to-noise ratio, offers advantages over traditional error probability or risk criteria in this problem because of the greater tractability of the distance measure. Within this framework, a design procedure is developed which applies to a more general class of problems than do earlier robustness results based on risks. Furthermore, it is shown for the general case that when a decision rule exists that is robust in terms of risk, the same decision rule will be robust in terms of distance, a fact which supports the use of the latter criterion.
This paper extends the empirical minimum divergence approach for models, which satisfy linear constraints with respect to the probability measure of the underlying variable (moment constraints) to the case where … This paper extends the empirical minimum divergence approach for models, which satisfy linear constraints with respect to the probability measure of the underlying variable (moment constraints) to the case where such constraints pertain to its quantile measure (called here semiparametric quantile models). The case when these constraints describe shape conditions as handled by the L-moments is considered, and both the description of these models as well as the resulting nonclassical minimum divergence procedures are presented. These models describe neighbourhoods of classical models used mainly for their tail behavior, for example, neighborhoods of Pareto or Weibull distributions, with which they may share the same first L-moments. The properties of the resulting estimators are illustrated by simulated examples comparing maximum likelihood estimators on Pareto and Weibull models to the minimum chi-square empirical divergence approach on semiparametric quantile models, and others.
A divergence measure between discrete probability distributions introduced by Csiszar (1967) generalizes the Kullback-Leibler information and several other information measures considered in the literature. We introduce a weighted divergence which … A divergence measure between discrete probability distributions introduced by Csiszar (1967) generalizes the Kullback-Leibler information and several other information measures considered in the literature. We introduce a weighted divergence which generalizes the weighted Kullback-Leibler information considered by Taneja (1985). The weighted divergence between an empirical distribution and a fixed distribution and the weighted divergence between two independent empirical distributions are here investigated for large simple random samples, and the asymptotic distributions are shown to be either normal or equal to the distribution of a linear combination of independent X2-variables
Abstract New information inequalities involving f-divergences have been established using the convexity arguments and some well known inequalities such as the Jensen inequality and the Arithmetic-Geometric Mean (AGM) inequality. Some … Abstract New information inequalities involving f-divergences have been established using the convexity arguments and some well known inequalities such as the Jensen inequality and the Arithmetic-Geometric Mean (AGM) inequality. Some particular cases have also been discussed.
In the context of supervised learning, meta learning uses features, metadata and other information to learn about the difficulty, behavior, or composition of the problem. Using this knowledge can be … In the context of supervised learning, meta learning uses features, metadata and other information to learn about the difficulty, behavior, or composition of the problem. Using this knowledge can be useful to contextualize classifier results or allow for targeted decisions about future data sampling. In this paper, we are specifically interested in learning the Bayes error rate (BER) based on a labeled data sample. Providing a tight bound on the BER that is also feasible to estimate has been a challenge. Previous work [1] has shown that a pairwise bound based on the sum of Henze-Penrose (HP) divergence over label pairs can be directly estimated using a sum of Friedman-Rafsky (FR) multivariate run test statistics. However, in situations in which the dataset and number of classes are large, this bound is computationally infeasible to calculate and may not be tight. Other multi-class bounds also suffer from computationally complex estimation procedures. In this paper, we present a generalized HP divergence measure that allows us to estimate the Bayes error rate with log-linear computation. We prove that the proposed bound is tighter than both the pairwise method and a bound proposed by Lin [2]. We also empirically show that these bounds are close to the BER. We illustrate the proposed method on the MNIST dataset, and show its utility for the evaluation of feature reduction strategies. We further demonstrate an approach for evaluation of deep learning architectures using the proposed bounds.
Integral functionals based on convex normal integrands are minimized subject to finitely many moment constraints. The integrands are finite on the positive and infinite on the negative numbers, strictly convex … Integral functionals based on convex normal integrands are minimized subject to finitely many moment constraints. The integrands are finite on the positive and infinite on the negative numbers, strictly convex but not necessarily differentiable. The minimization is viewed as a primal problem and studied together with a dual one in the framework of convex duality. The effective domain of the value function is described by a conic core, a modification of the earlier concept of convex core. Minimizers and generalized minimizers are explicitly constructed from solutions of modified dual problems, not assuming the primal constraint qualification. A generalized Pythagorean identity is presented using Bregman distance and a correction term for lack of essential smoothness in integrands. Results are applied to minimization of Bregman distances. Existence of a generalized dual solution is established whenever the dual value is finite, assuming the dual constraint qualification. Examples of `irregular' situations are included, pointing to the limitations of generality of certain key results.
We propose a framework to analyze and quantify the bias in adaptive data analysis. It generalizes that proposed by Russo and Zou'15, applying to measurements whose moment generating function exists, … We propose a framework to analyze and quantify the bias in adaptive data analysis. It generalizes that proposed by Russo and Zou'15, applying to measurements whose moment generating function exists, measurements with a finite p-norm, and measurements in general Orlicz spaces. We introduce a new class of dependence measures which retain key properties of mutual information while more effectively quantifying the exploration bias for heavy tailed distributions. We provide examples of cases where our bounds are nearly tight in situations where the original framework of Russo and Zou'15 does not apply.
We prove new entropy inequalities for log concave and s-concave functions that strengthen and generalize recently established reverse log Sobolev and Poincaré inequalities for such functions. This leads naturally to … We prove new entropy inequalities for log concave and s-concave functions that strengthen and generalize recently established reverse log Sobolev and Poincaré inequalities for such functions. This leads naturally to the concept of f-divergence and, in particular, relative entropy for s-concave and log concave functions. We establish their basic properties, among them the affine invariant valuation property. Applications are given in the theory of convex bodies.
In this article, we assume that categorical data axe distributed according to a multinomial distribution whose probabilities follow a loglinear model. The inference problem we consider is that of hypothesis … In this article, we assume that categorical data axe distributed according to a multinomial distribution whose probabilities follow a loglinear model. The inference problem we consider is that of hypothesis testing in a loglinear-model setting. The null hypothesis is a composite hypothesis nested within the alternative. Test statistics are chosen from the general class of phi-divergence statistics. This article collects together the operating characteristics of the hypothesis test based on both asymptotic (using large-sample theory) and finite-sample (using a designed simulation study) results. Members of the class of power divergence statistics are compared, and it is found that the Cressie-Read statistic offers an attractive alternative to the Pearson-based and the likelihood ratio-based test statistics, in terms of both exact and asymptotic size and power.
We generalise the classical Pinsker inequality which relates variational divergence to KullbackLiebler divergence in two ways: we consider arbitrary f -divergences in place of KL divergence, and we assume knowledge … We generalise the classical Pinsker inequality which relates variational divergence to KullbackLiebler divergence in two ways: we consider arbitrary f -divergences in place of KL divergence, and we assume knowledge of a sequence of values of generalised variational divergences. We then develop a best possible inequality for this doubly generalised situation. Specialising our result to the classical case provides a new and tight explicit bound relating KL to variational divergence (solving a problem posed by Vajda some 40 years ago). The solution relies on exploiting a connection between divergences and the Bayes risk of a learning problem via an integral representation.
Maximum entropy approach to classification is very well studied in applied statistics and machine learning and almost all the methods that exists in literature are discriminative in nature. In this … Maximum entropy approach to classification is very well studied in applied statistics and machine learning and almost all the methods that exists in literature are discriminative in nature. In this paper, we introduce a maximum entropy classification method with feature selection for large dimensional data such as text datasets that is generative in nature. To tackle the curse of dimensionality of large data sets, we employ conditional independence assumption (Naive Bayes) and we perform feature selection simultaneously, by enforcing a 'maximum discrimination' between estimated class conditional densities. For two class problems, in the proposed method, we use Jeffreys (J) divergence to discriminate the class conditional densities. To extend our method to the multi-class case, we propose a completely new approach by considering a multi-distribution divergence: we replace Jeffreys divergence by Jensen-Shannon (JS) divergence to discriminate conditional densities of multiple classes. In order to reduce computational complexity, we employ a modified Jensen-Shannon divergence (JS_GM), based on AM-GM inequality. We show that the resulting divergence is a natural generalization of Jeffreys divergence to a multiple distributions case. As far as the theoretical justifications are concerned we show that when one intends to select the best features in a generative maximum entropy approach, maximum discrimination using J-divergence emerges naturally in binary classification. Performance and comparative study of the proposed algorithms have been demonstrated on large dimensional text and gene expression datasets that show our methods scale up very well with large dimensional datasets.
We develop a rigorous and general framework for constructing information-theoretic divergences that subsume both $f$-divergences and integral probability metrics (IPMs), such as the $1$-Wasserstein distance. We prove under which assumptions … We develop a rigorous and general framework for constructing information-theoretic divergences that subsume both $f$-divergences and integral probability metrics (IPMs), such as the $1$-Wasserstein distance. We prove under which assumptions these divergences, hereafter referred to as $(f,Γ)$-divergences, provide a notion of `distance' between probability measures and show that they can be expressed as a two-stage mass-redistribution/mass-transport process. The $(f,Γ)$-divergences inherit features from IPMs, such as the ability to compare distributions which are not absolutely continuous, as well as from $f$-divergences, namely the strict concavity of their variational representations and the ability to control heavy-tailed distributions for particular choices of $f$. When combined, these features establish a divergence with improved properties for estimation, statistical learning, and uncertainty quantification applications. Using statistical learning as an example, we demonstrate their advantage in training generative adversarial networks (GANs) for heavy-tailed, not-absolutely continuous sample distributions. We also show improved performance and stability over gradient-penalized Wasserstein GAN in image generation.
Just as the classical f-divergences parameterized on convex functions generalize the classical relative entropy, the quantum quasi-entropies generalize the quantum relative entropy parameterized on operator convex functions. Quantum quasi-entropies satisfy … Just as the classical f-divergences parameterized on convex functions generalize the classical relative entropy, the quantum quasi-entropies generalize the quantum relative entropy parameterized on operator convex functions. Quantum quasi-entropies satisfy a version of the monotonicity property and are jointly convex in their arguments. We provide the equality conditions for these inequalities for a family of operator convex functions that includes the von Neumann entropies as a special case. Quantum quasi-entropies are defined with an arbitrarily chosen matrix and since the inequalities are true for any choice of such matrices, we show that these inequalities can be interpreted as operator inequalities.
We study the detection of a sparse change in a high-dimensional mean vector as a minimax testing problem. Our first main contribution is to derive the exact minimax testing rate … We study the detection of a sparse change in a high-dimensional mean vector as a minimax testing problem. Our first main contribution is to derive the exact minimax testing rate across all parameter regimes for $n$ independent, $p$-variate Gaussian observations. This rate exhibits a phase transition when the sparsity level is of order $\sqrt{p \log \log (8n)}$ and has a very delicate dependence on the sample size: in a certain sparsity regime it involves a triple iterated logarithmic factor in~$n$. Further, in a dense asymptotic regime, we identify the sharp leading constant, while in the corresponding sparse asymptotic regime, this constant is determined to within a factor of $\sqrt{2}$. Extensions that cover spatial and temporal dependence, primarily in the dense case, are also provided.
In this work we explore the class of heavy-tailed distributions and discuss their signicance in reliability engineering. At the same time we discuss measures of divergence which are extensively used … In this work we explore the class of heavy-tailed distributions and discuss their signicance in reliability engineering. At the same time we discuss measures of divergence which are extensively used in statistics in various elds. In this paper we rely on such measures to evaluate the residual and past lifetimes of events which are associated with the tail part of the distribution. More specifically, we propose a class of goodness of t tests based on Csiszar's class of measures designed for heavy-tailed distributions.
In this article, the defining properties of any valid measure of the dependence between two continuous random variables are revisited and complemented with two original ones, shown to imply other … In this article, the defining properties of any valid measure of the dependence between two continuous random variables are revisited and complemented with two original ones, shown to imply other usual postulates. While other popular choices are proved to violate some of these requirements, a class of dependence measures satisfying all of them is identified. One particular measure, that we call the Hellinger correlation, appears as a natural choice within that class due to both its theoretical and intuitive appeal. A simple and efficient nonparametric estimator for that quantity is proposed, with its implementation publicly available in the R package HellCor. Synthetic and real-data examples illustrate the descriptive ability of the measure, which can also be used as test statistic for exact independence testing. Supplementary materials for this article are available online.
We propose a framework for general Bayesian inference. We argue that a valid update of a prior belief distribution to a posterior can be made for parameters which are connected … We propose a framework for general Bayesian inference. We argue that a valid update of a prior belief distribution to a posterior can be made for parameters which are connected to observations through a loss function rather than the traditional likelihood function, which is recovered under the special case of using self information loss. Modern application areas make it is increasingly challenging for Bayesians to attempt to model the true data generating mechanism. Moreover, when the object of interest is low dimensional, such as a mean or median, it is cumbersome to have to achieve this via a complete model for the whole data distribution. More importantly, there are settings where the parameter of interest does not directly index a family of density functions and thus the Bayesian approach to learning about such parameters is currently regarded as problematic. Our proposed framework uses loss-functions to connect information in the data to functionals of interest. The updating of beliefs then follows from a decision theoretic approach involving cumulative loss functions. Importantly, the procedure coincides with Bayesian updating when a true likelihood is known, yet provides coherent subjective inference in much more general settings. Connections to other inference frameworks are highlighted.
A unified family of goodness-of-fit tests based on φ-divergences is introduced and studied. The new family of test statistics Sn(s) includes both the supremum version of the Anderson–Darling statistic and … A unified family of goodness-of-fit tests based on φ-divergences is introduced and studied. The new family of test statistics Sn(s) includes both the supremum version of the Anderson–Darling statistic and the test statistic of Berk and Jones [Z. Wahrsch. Verw. Gebiete 47 (1979) 47–59] as special cases (s=2 and s=1, resp.). We also introduce integral versions of the new statistics. We show that the asymptotic null distribution theory of Berk and Jones [Z. Wahrsch. Verw. Gebiete 47 (1979) 47–59] and Wellner and Koltchinskii [High Dimensional Probability III (2003) 321–332. Birkhäuser, Basel] for the Berk–Jones statistic applies to the whole family of statistics Sn(s) with s∈[−1, 2]. On the side of power behavior, we study the test statistics under fixed alternatives and give extensions of the “Poisson boundary” phenomena noted by Berk and Jones for their statistic. We also extend the results of Donoho and Jin [Ann. Statist. 32 (2004) 962–994] by showing that all our new tests for s∈[−1, 2] have the same “optimal detection boundary” for normal shift mixture alternatives as Tukey’s “higher-criticism” statistic and the Berk–Jones statistic.
We propose a structure of a semiparametric two-component mixture model when one component is parametric and the other is defined through linear constraints on its distribution function. Estimation of a … We propose a structure of a semiparametric two-component mixture model when one component is parametric and the other is defined through linear constraints on its distribution function. Estimation of a two-component mixture model with an unknown component is very difficult when no particular assumption is made on the structure of the unknown component. A symmetry assumption was used in the literature to simplify the estimation. Such method has the advantage of producing consistent and asymptotically normal estimators, and identifiability of the semiparametric mixture model becomes tractable. Still, existing methods which estimate a semiparametric mixture model have their limits when the parametric component has unknown parameters or the proportion of the parametric part is either very high or very low. We propose in this paper a method to incorporate a prior linear information about the distribution of the unknown component in order to better estimate the model when existing estimation methods fail. The new method is based on $φ-$divergences and has an original form since the minimization is carried over both arguments of the divergence. The resulting estimators are proved to be consistent and asymptotically normal under standard assumptions. We show that using the Pearson's $χ^2$ divergence our algorithm has a linear complexity when the constraints are moment-type. Simulations on univariate and multivariate mixtures demonstrate the viability and the interest of our novel approach.
In this paper we present a simulation study to analyze the behavior of the $\phi $-divergence test statistics in the problem of goodness-of-fit for loglinear models with linear constraints and … In this paper we present a simulation study to analyze the behavior of the $\phi $-divergence test statistics in the problem of goodness-of-fit for loglinear models with linear constraints and multinomial sampling. We pay special attention to the Renyi’s and $I_{r}$-divergence measures.
The asymptotic cumulants of the minimum phi-divergence estimators of the parameters in a model for categorical data are obtained up to the fourth order with the higher-order asymptotic variance under … The asymptotic cumulants of the minimum phi-divergence estimators of the parameters in a model for categorical data are obtained up to the fourth order with the higher-order asymptotic variance under possible model misspecification. The corresponding asymptotic cumulants up to the third order for the studentized minimum phi-divergence estimator are also derived. These asymptotic cumulants, when a model is misspecified, depend on the form of the phi-divergence. Numerical illustrations with simulations are given for typical cases of the phi-divergence, where the maximum likelihood estimator does not necessarily give best results. Real data examples are shown using log-linear models for contingency tables.
Mixed $f$-divergences, a concept from information theory and statistics, measure the difference between multiple pairs of distributions. We introduce them for log concave functions and establish some of their properties. … Mixed $f$-divergences, a concept from information theory and statistics, measure the difference between multiple pairs of distributions. We introduce them for log concave functions and establish some of their properties. Among them are affine invariant vector entropy inequalities, like new Alexandrov-Fenchel type inequalities and an affine isoperimetric inequality for the vector form of the Kullback Leibler divergence for log concave functions. Special cases of $f$-divergences are mixed $L_\lambda$-affine surface areas for log concave functions. For those, we establish various affine isoperimetric inequalities as well as a vector Blaschke Santalo type inequality.
The performance in terms of minimal Bayes&amp;rsquo; error probability for detection of ahigh-dimensional random tensor is a fundamental under-studied difficult problem. In this work, weconsider two Signal to Noise Ratio … The performance in terms of minimal Bayes&amp;rsquo; error probability for detection of ahigh-dimensional random tensor is a fundamental under-studied difficult problem. In this work, weconsider two Signal to Noise Ratio (SNR)-based detection problems of interest. Under the alternativehypothesis, i.e., for a non-zero SNR, the observed signals are either a noisy rank-R tensor admitting aQ-order Canonical Polyadic Decomposition (CPD) with large factors of size Nq R, i.e, for 1 q Q,where R, Nq ! &amp;yen; with R1/q/Nq converge towards a finite constant or a noisy tensor admittingTucKer Decomposition (TKD) of multilinear (M1, . . . ,MQ)-rank with large factors of size Nq Mq,i.e, for 1 q Q, where Nq,Mq ! &amp;yen; with Mq/Nq converge towards a finite constant. The detectionof the random entries (coefficients) of the core tensor in the CPD/TKD is hard to study since theexact derivation of the error probability is mathematically intractable. To circumvent this technicaldifficulty, the Chernoff Upper Bound (CUB) for larger SNR and the Fisher information at low SNRare derived and studied, based on information geometry theory. The tightest CUB is reached forthe value minimizing the error exponent, denoted by s?. In general, due to the asymmetry of thes-divergence, the Bhattacharyya Upper Bound (BUB) (that is, the Chernoff Information calculated ats? = 1/2) can not solve this problem effectively. As a consequence, we rely on a costly numericaloptimization strategy to find s?. However, thanks to powerful random matrix theory tools, a simpleanalytical expression of s? is provided with respect to the Signal to Noise Ratio (SNR) in the twoschemes considered. A main conclusion of this work is that the BUB is the tightest bound at lowSNRs. This property is, however, no longer true for higher SNRs.
We unify f-divergences, Bregman divergences, surrogate loss bounds (regret bounds), proper scoring rules, matching losses, cost curves, ROC-curves and information. We do this by systematically studying integral and variational representations … We unify f-divergences, Bregman divergences, surrogate loss bounds (regret bounds), proper scoring rules, matching losses, cost curves, ROC-curves and information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their primitives which all are related to cost-sensitive binary classification. As well as clarifying relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate loss bounds and generalised Pinsker inequalities relating f-divergences to variational divergence. The new viewpoint illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates Maximum Mean Discrepancy to Fisher Linear Discriminants. It also suggests new techniques for estimating f-divergences.
In many cases an optimum or computationally convenient test of a simple hypothesis $H_0$ against a simple alternative $H_1$ may be given in the following form. Reject $H_0$ if $S_n … In many cases an optimum or computationally convenient test of a simple hypothesis $H_0$ against a simple alternative $H_1$ may be given in the following form. Reject $H_0$ if $S_n = \sum^n_{j=1} X_j \leqq k,$ where $X_1, X_2, \cdots, X_n$ are $n$ independent observations of a chance variable $X$ whose distribution depends on the true hypothesis and where $k$ is some appropriate number. In particular the likelihood ratio test for fixed sample size can be reduced to this form. It is shown that with each test of the above form there is associated an index $\rho$. If $\rho_1$ and $\rho_2$ are the indices corresponding to two alternative tests $e = \log \rho_1/\log \rho_2$ measures the relative efficiency of these tests in the following sense. For large samples, a sample of size $n$ with the first test will give about the same probabilities of error as a sample of size $en$ with the second test. To obtain the above result, use is made of the fact that $P(S_n \leqq na)$ behaves roughly like $m^n$ where $m$ is the minimum value assumed by the moment generating function of $X - a$. It is shown that if $H_0$ and $H_1$ specify probability distributions of $X$ which are very close to each other, one may approximate $\rho$ by assuming that $X$ is normally distributed.
This chapter presents the basic concepts and results of the theory of testing statistical hypotheses. The generalized likelihood ratio tests that are discussed can be applied to testing in the … This chapter presents the basic concepts and results of the theory of testing statistical hypotheses. The generalized likelihood ratio tests that are discussed can be applied to testing in the presence of nuisance parameters. Besides the likelihood ratio tests, for testing in the presence of nuisance parameters one can use conditional tests. The chapter also presents the motivation for steps of the proof of the randomization principle theorem. It considers the case of a single observation, but the extension to the case of n observations will be obvious. The chapter presents an approach that requires unbiasedness and explains how the theory of testing statistical hypotheses is related to the theory of confidence intervals. It reviews the major testing procedures for parameters of normal distributions and is intended as a convenient reference for users rather than an exposition of new concepts or results.
Summary In a previous paper by one of the authors (Silvey, 1964) it was suggested that the Radon–Nikodym derivative of the joint distribution of two, not necessarily real-valued, random variables … Summary In a previous paper by one of the authors (Silvey, 1964) it was suggested that the Radon–Nikodym derivative of the joint distribution of two, not necessarily real-valued, random variables with respect to the product of their marginal distributions provides the analytic key for discussing association between the random variables. In this paper we shall develop this point and lend some support to the suggestion by proving that in the case of normal random vectors the distribution of this derivative, as determined by the product of the marginal distributions, becomes more widely spread out as association between the vectors increases. More precisely we shall show that the expected value of any continuous convex function of the derivative is a non-decreasing function of each of the canonical correlation coefficients between the two vectors.