Rényi Divergence and Kullback-Leibler Divergence

Type: Article
Publication Date: 2014-06-19
Citations: 1254
DOI: https://doi.org/10.1109/tit.2014.2320500

Abstract

Rényi divergence is related to Rényi entropy much like Kullback-Leibler divergence is related to Shannon's entropy, and comes up in many settings. It was introduced by Rényi as a measure of information that satisfies almost the same axioms as Kullback-Leibler divergence, and depends on a parameter that is called its order. In particular, the Rényi divergence of order 1 equals the Kullback-Leibler divergence. We review and extend the most important properties of Rényi divergence and Kullback-Leibler divergence, including convexity, continuity, limits of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="TeX">\(\sigma \) </tex-math></inline-formula> -algebras, and the relation of the special order 0 to the Gaussian dichotomy and contiguity. We also show how to generalize the Pythagorean inequality to orders different from 1, and we extend the known equivalence between channel capacity and minimax redundancy to continuous channel inputs (for all orders) and present several other minimax results.

Locations

  • arXiv (Cornell University)
  • HAL (Le Centre pour la Communication Scientifique Directe)
  • DataCite API
  • IEEE Transactions on Information Theory

Ask a Question About This Paper

Summary

Login to see paper summary

Abstract Having a distance measure between quantum states satisfying the right properties is of fundamental importance in all areas of quantum information. In this work, we present a systematic study … Abstract Having a distance measure between quantum states satisfying the right properties is of fundamental importance in all areas of quantum information. In this work, we present a systematic study of the geometric Rényi divergence (GRD), also known as the maximal Rényi divergence, from the point of view of quantum information theory. We show that this divergence, together with its extension to channels, has many appealing structural properties, which are not satisfied by other quantum Rényi divergences. For example we prove a chain rule inequality that immediately implies the “amortization collapse” for the geometric Rényi divergence, addressing an open question by Berta et al. [Letters in Mathematical Physics 110:2277–2336, 2020, Equation (55)] in the area of quantum channel discrimination. As applications, we explore various channel capacity problems and construct new channel information measures based on the geometric Rényi divergence, sharpening the previously best-known bounds based on the max-relative entropy while still keeping the new bounds single-letter and efficiently computable. A plethora of examples are investigated and the improvements are evident for almost all cases.
We present a systematic study of the geometric Renyi divergence (GRD), also known as the maximal Renyi divergence, from the point of view of quantum information theory. We show that … We present a systematic study of the geometric Renyi divergence (GRD), also known as the maximal Renyi divergence, from the point of view of quantum information theory. We show that this divergence, together with its extension to channels, has many appealing structural properties. For example we prove a chain rule inequality that immediately implies the amortization collapse for the geometric Renyi divergence, addressing an open question by Berta et al. [arXiv:1808.01498, Equation (55)] in the area of quantum channel discrimination. As applications, we explore various channel capacity problems and construct new channel information measures based on the geometric Renyi divergence, sharpening the previously best-known bounds based on the max-relative entropy while still keeping the new bounds single-letter efficiently computable. A plethora of examples are investigated and the improvements are evident for almost all cases.

f-Divergences

2024-12-31
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation … This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. Introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite block-length approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC Bayes and variational principle, Kolmogorov's metric entropy, strong data processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by a solutions manual for instructors, and additional standalone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
The Kullback-Leibler divergence rate of two finite or countable ergodic Markov chains is well-known to exist and have an explicit expression as a function of the transition matrices of the … The Kullback-Leibler divergence rate of two finite or countable ergodic Markov chains is well-known to exist and have an explicit expression as a function of the transition matrices of the chains, allowing access to classical tools for applications, such as minimization under constraints or projections on convex sets. The existence of Rényi divergence rates of ergodic Markov chains has been established in [5], [12]; here we establish explicit expressions for them and prove some properties of the resulting measures of discrepancy between stochastic matrices, opening the way to applications.
I provide some technical notes regarding the Kullback-Leibler divergence. Derivations of the Kullback-Leibler divergence are provided for Bernoulli, Geometric, Poisson, Exponential, and Normal distributions. I provide some technical notes regarding the Kullback-Leibler divergence. Derivations of the Kullback-Leibler divergence are provided for Bernoulli, Geometric, Poisson, Exponential, and Normal distributions.
This chapter contains sections titled: Method of Types Law of Large Numbers Universal Source Coding Large Deviation Theory Examples of Sanov's Theorem Conditional Limit Theorem Hypothesis Testing Chernoff–Stein Lemma Chernoff … This chapter contains sections titled: Method of Types Law of Large Numbers Universal Source Coding Large Deviation Theory Examples of Sanov's Theorem Conditional Limit Theorem Hypothesis Testing Chernoff–Stein Lemma Chernoff Information Fisher Information and the Cramér–Rao Inequality Summary Problems Historical Notes
We introduce an axiomatic approach for channel divergences and channel relative entropies that is based on three information-theoretic axioms of monotonicity under superchannels, i.e., generalized data processing inequality, additivity under … We introduce an axiomatic approach for channel divergences and channel relative entropies that is based on three information-theoretic axioms of monotonicity under superchannels, i.e., generalized data processing inequality, additivity under tensor products, and normalization, similar to the approach given for the state domain in Gour and Tomamichel [arXiv:2006.11164v1 (2020), arXiv:2006.12408v2 (2020)]. We show that these axioms are sufficient to give enough structure in the channel domain as well, leading to numerous properties that are applicable to all channel divergences. These include faithfulness, continuity, a type of triangle inequality, and boundedness between the min and max channel relative entropies. In addition, we prove a uniqueness theorem showing that the Kullback-Leibler divergence has only one extension to classical channels. For quantum channels, with the exception of the max relative entropy, this uniqueness does not hold. Instead, we prove the optimality of the amortized channel extension of the Umegaki relative entropy, by showing that it provides a lower bound on all channel relative entropies that reduce to the Kullback-Leibler divergence on classical states. We also introduce the maximal channel extension of a given classical state divergence and study its properties.Received 24 July 2020Revised 11 December 2020Accepted 22 December 2020DOI:https://doi.org/10.1103/PRXQuantum.2.010313Published by the American Physical Society under the terms of the Creative Commons Attribution 4.0 International license. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.Published by the American Physical SocietyPhysics Subject Headings (PhySH)Research AreasQuantum channelsQuantum correlations in quantum informationQuantum information processingResource theoriesQuantum Information
AbstractThe problem of quantifying the distance between distributions arises in various fields, including cryptography, information theory, communication networks, machine learning, and data mining. In this article, the analogy with the … AbstractThe problem of quantifying the distance between distributions arises in various fields, including cryptography, information theory, communication networks, machine learning, and data mining. In this article, the analogy with the cumulative Jensen–Shannon divergence, defined in Nguyen and Vreeken (Citation2015), we propose a new divergence measure based on the cumulative distribution function and call it the cumulative α-Jensen–Shannon divergence, denoted by CJS(α). Properties of CJS(α) are studied in detail, and also two upper bounds for CJS(α) are obtained. The simplified results under the proportional reversed hazard rate model are given. Various illustrative examples are analyzed.KEYWORDS: Distribution functionEntropyKullback–Leibler divergenceproportional reversed hazard rate model2020 MSC: 94A1762B10 Authors contributionsRiyahi: Conceptualization, Methodology, Software, Validation, Writing- Original draft preparation, Writing- Reviewing and Editing. M. Baratnia: Conceptualization, Software, Validation, Reviewing and Editing. M. Doostparast: Conceptualization, Methodology, Supervision, Writing- Original draft preparation, Writing- Reviewing and Editing. Authors contributionsRiyahi: Conceptualization, Methodology, Software, Validation, Writing- Original draft preparation, Writing- Reviewing and Editing. M. Baratnia: Conceptualization, Software, Validation, Reviewing and Editing. M. Doostparast: Conceptualization, Methodology, Supervision, Writing- Original draft preparation, Writing- Reviewing and Editing.Disclosure statementNo potential conflict of interest was reported by the authors.
This work presents an upper-bound to value that the Kullback-Leibler (KL) divergence can reach for a class of probability distributions called quantum distributions (QD). The aim is to find a … This work presents an upper-bound to value that the Kullback-Leibler (KL) divergence can reach for a class of probability distributions called quantum distributions (QD). The aim is to find a distribution $U$ which maximizes the KL divergence from a given distribution $P$ under the assumption that $P$ and $U$ have been generated by distributing a given discrete quantity, a quantum. Quantum distributions naturally represent a wide range of probability distributions that are used in practical applications. Moreover, such a class of distributions can be obtained as an approximation of any probability distribution. The retrieving of an upper-bound for the entropic divergence is here shown to be possible under the condition that the compared distributions are quantum distributions over the same quantum value, thus they become comparable. Thus, entropic divergence acquires a more powerful meaning when it is applied to comparable distributions. This aspect should be taken into account in future developments of divergences. The theoretical findings are used for proposing a notion of normalized KL divergence that is empirically shown to behave differently from already known measures.
In the problem of binary quantum channel discrimination with product inputs, the supremum of all type II error exponents for which the optimal type I errors go to zero is … In the problem of binary quantum channel discrimination with product inputs, the supremum of all type II error exponents for which the optimal type I errors go to zero is equal to the Umegaki channel relative entropy, while the infimum of all type II error exponents for which the optimal type I errors go to one is equal to the infimum of the sandwiched channel Rényi <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\alpha $ </tex-math></inline-formula> -divergences over all <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\alpha &gt;1$ </tex-math></inline-formula> . We prove the equality of these two threshold values (and therefore the strong converse property for this problem) using a minimax argument based on a newly established continuity property of the sandwiched Rényi divergences. Motivated by this, we give a detailed analysis of the continuity properties of various other quantum (channel) Rényi divergences, which may be of independent interest.
The scaled Rényi information plays a significant role in evaluating the performance of information processing tasks by virtue of its connection to the error exponent analysis. In quantum information theory, … The scaled Rényi information plays a significant role in evaluating the performance of information processing tasks by virtue of its connection to the error exponent analysis. In quantum information theory, there are three generalizations of the classical Rényi divergence-the Petz's, sandwiched, and log-Euclidean versions, that possess meaningful operational interpretation. The goal of this paper is thus to analyze fundamental properties of scaled Rényi information from a noncommutative measure-theoretic perspective. Firstly, we prove the uniform equicontinuity for all three quantum versions of Rényi information, hence it yields the joint continuity of these quantities in the orders and priors. Secondly, we establish the concavity in the region of s ∈ (-1, 0) for both Petz's and the sandwiched versions. This completes the open questions raised by Holevo, Mosonyi and Ogawa. For the applications, we show that the strong converse exponent in classical-quantum channel coding satisfies a minimax identity. The established concavity is further employed to prove an entropic duality between classical data compression with quantum side information and classical-quantum channel coding, and a Fenchel duality in joint source-channel coding with quantum side information.
This paper starts by considering the minimization of the Rényi divergence subject to a constraint on the total variation distance. Based on the solution of this optimization problem, the exact … This paper starts by considering the minimization of the Rényi divergence subject to a constraint on the total variation distance. Based on the solution of this optimization problem, the exact locus of the points (D(QIIP <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> ), D(QIIP <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> )) is determined when P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> , P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> , and Q are arbitrary probability measures which are mutually absolutely continuous, and the total variation distance between P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> and P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> is not below a given value. It is further shown that all the points of this convex region are attained by probability measures which are defined on a binary alphabet. This characterization yields a geometric interpretation of the minimal Chernoff information subject to a constraint on the variational distance. This paper also derives an exponential upper bound on the performance of binary linear block codes (or code ensembles) under maximum-likelihood decoding. Its derivation relies on the Gallager bounding technique, and it reproduces the Shulman-Feder bound as a special case. The bound is expressed in terms of the Rényi divergence from the normalized distance spectrum of the code (or the average distance spectrum of the ensemble) to the binomially distributed distance spectrum of the capacity-achieving ensemble of random block codes. This exponential bound provides a quantitative measure of the degradation in performance of binary linear block codes (or code ensembles) as a function of the deviation of their distance spectra from the binomial distribution. An efficient use of this bound is considered.
The Kullback-Leibler divergence (KL divergence) is a statistical measure that quantifies the difference between two probability distributions. Specifically, it assesses the amount of information that is lost when one distribution … The Kullback-Leibler divergence (KL divergence) is a statistical measure that quantifies the difference between two probability distributions. Specifically, it assesses the amount of information that is lost when one distribution is used to approximate another. This concept is crucial in various fields, including information theory, statistics, and machine learning, as it helps in understanding how well a model represents the underlying data. In a recent study by Nawa and Nadarajah, a comprehensive collection of exact expressions for the Kullback-Leibler divergence was derived for both multivariate and matrix-variate distributions. This work is significant as it expands on our existing knowledge of KL divergence by providing precise formulations for over sixty univariate distributions. The authors also ensured the accuracy of these expressions through numerical checks, which adds a layer of validation to their findings. The derived expressions incorporate various special functions, highlighting the mathematical complexity and richness of the topic. This research contributes to a deeper understanding of KL divergence and its applications in statistical analysis and modeling.
Entropy is an important concept in many fields related to communications [...]. Entropy is an important concept in many fields related to communications [...].
Information theory is a mathematical theory of learning with deep connections with topics as diverse as artificial intelligence, statistical physics, and biological evolution. Many primers on information theory paint a … Information theory is a mathematical theory of learning with deep connections with topics as diverse as artificial intelligence, statistical physics, and biological evolution. Many primers on information theory paint a broad picture with relatively little mathematical sophistication, while many others develop specific application areas in detail. In contrast, these informal notes aim to outline some elements of the information-theoretic "way of thinking," by cutting a rapid and interesting path through some of the theory's foundational concepts and results. They are aimed at practicing systems scientists who are interested in exploring potential connections between information theory and their own fields. The main mathematical prerequisite for the notes is comfort with elementary probability, including sample spaces, conditioning, and expectations. We take the Kullback-Leibler divergence as our most basic concept, and then proceed to develop the entropy and mutual information. We discuss some of the main results, including the Chernoff bounds as a characterization of the divergence; Gibbs' Theorem; and the Data Processing Inequality. A recurring theme is that the definitions of information theory support natural theorems that sound ``obvious'' when translated into English. More pithily, ``information theory makes common sense precise.'' Since the focus of the notes is not primarily on technical details, proofs are provided only where the relevant techniques are illustrative of broader themes. Otherwise, proofs and intriguing tangents are referenced in liberally-sprinkled footnotes. The notes close with a highly nonexhaustive list of references to resources and other perspectives on the field.
When two input discrete distributions pass through a memoryless channel, the data processing inequality expresses the fact that the f divergence decreases. As is well known, in the case of … When two input discrete distributions pass through a memoryless channel, the data processing inequality expresses the fact that the f divergence decreases. As is well known, in the case of the Kullback Leibler divergence, the chain rule makes it possible to obtain an exact expression for the gap in the data processing inequality, as a conditional divergence. In this paper, we provide a general expression for the gap valid for any convex function f , which generalizes the well-known formula for the Kullback-Leibler case. The result is a sum of Bregman divergences, which shows non-negativity. We then show how this identity may be used to give simple proofs of two classical results in linear algebra.
The Kullback–Leibler (KL) divergence is a widely used measure for comparing probability distributions, but it faces limitations such as its unbounded nature and the lack of comparability between distributions with … The Kullback–Leibler (KL) divergence is a widely used measure for comparing probability distributions, but it faces limitations such as its unbounded nature and the lack of comparability between distributions with different quantum values (the discrete unit of probability). This study addresses these challenges by introducing the concept of quantized distributions, which are probability distributions formed by distributing a given discrete quantity or quantum. This study establishes an upper bound for the KL divergence between two quantized distributions, enabling the development of a normalized KL divergence that ranges between 0 and 1. The theoretical findings are supported by empirical evaluations, demonstrating the distinct behavior of the normalized KL divergence compared to other commonly used measures. The results highlight the importance of considering the quantum value when applying the KL divergence, offering insights for future advancements in divergence measures.
Abstract This paper introduces a new dissimilarity measure between two discrete and finite probability distributions. The followed approach is grounded jointly on mixtures of probability distributions and an optimization procedure. … Abstract This paper introduces a new dissimilarity measure between two discrete and finite probability distributions. The followed approach is grounded jointly on mixtures of probability distributions and an optimization procedure. We discuss the clear interpretation of the constitutive elements of the measure under an information-theoretical perspective by also highlighting its connections with the Rényi divergence of infinite order. Moreover, we show how the measure describes the inefficiency in assuming that a given probability distribution coincides with a benchmark one by giving formal writing of the random interference between the considered probability distributions. We explore the properties of the considered tool, which are in line with those defining the concept of quasimetric—i.e. a divergence for which the triangular inequality is satisfied. As a possible usage of the introduced device, an application to rare events is illustrated. This application shows that our measure may be suitable in cases where the accuracy of the small probabilities is a relevant matter.
Mathematical models in epidemiology are an indispensable tool to determine the dynamics and important characteristics of infectious diseases. Apart from their scientific merit, these models are often used to inform … Mathematical models in epidemiology are an indispensable tool to determine the dynamics and important characteristics of infectious diseases. Apart from their scientific merit, these models are often used to inform political decisions and intervention measures during an ongoing outbreak. However, reliably inferring the dynamics of ongoing outbreaks by connecting complex models to real data is still hard and requires either laborious manual parameter fitting or expensive optimization methods which have to be repeated from scratch for every application of a given model. In this work, we address this problem with a novel combination of epidemiological modeling with specialized neural networks. Our approach entails two computational phases: In an initial training phase, a mathematical model describing the epidemic is used as a coach for a neural network, which acquires global knowledge about the full range of possible disease dynamics. In the subsequent inference phase, the trained neural network processes the observed data of an actual outbreak and infers the parameters of the model in order to realistically reproduce the observed dynamics and reliably predict future progression. With its flexible framework, our simulation-based approach is applicable to a variety of epidemiological models. Moreover, since our method is fully Bayesian, it is designed to incorporate all available prior knowledge about plausible parameter values and returns complete joint posterior distributions over these parameters. Application of our method to the early Covid-19 outbreak phase in Germany demonstrates that we are able to obtain reliable probabilistic estimates for important disease characteristics, such as generation time, fraction of undetected infections, likelihood of transmission before symptom onset, and reporting delays using a very moderate amount of real-world observations.
We derive a variational formula for the optimal growth rate of reward in the infinite horizon risk-sensitive control problem for discrete time Markov decision processes with compact metric state and … We derive a variational formula for the optimal growth rate of reward in the infinite horizon risk-sensitive control problem for discrete time Markov decision processes with compact metric state and action spaces, extending a formula of Donsker and Varadhan for the Perron--Frobenius eigenvalue of a positive operator. This leads to a concave maximization formulation of the problem of determining this optimal growth rate.
Mixture models are widely used in Bayesian statistics and machine learning, in particular in computational biology, natural language processing and many other fields. Variational inference, a technique for approximating intractable … Mixture models are widely used in Bayesian statistics and machine learning, in particular in computational biology, natural language processing and many other fields. Variational inference, a technique for approximating intractable posteriors thanks to optimization algorithms, is extremely popular in practice when dealing with complex models such as mixtures. The contribution of this paper is two-fold. First, we study the concentration of variational approximations of posteriors, which is still an open problem for general mixtures, and we derive consistency and rates of convergence. We also tackle the problem of model selection for the number of components: we study the approach already used in practice, which consists in maximizing a numerical criterion (the Evidence Lower Bound). We prove that this strategy indeed leads to strong oracle inequalities. We illustrate our theoretical results by applications to Gaussian and multinomial mixtures.
Atar, Chowdhary and Dupuis have recently exhibited a variational formula for exponential integrals of bounded measurable functions in terms of Rényi divergences. We show that a variational characterization of the … Atar, Chowdhary and Dupuis have recently exhibited a variational formula for exponential integrals of bounded measurable functions in terms of Rényi divergences. We show that a variational characterization of the Rényi divergences between two probability distributions on a measurable space in terms of relative entropies, when combined with the elementary variational formula for exponential integrals of bounded measurable functions in terms of relative entropy, yields the variational formula of Atar, Chowdhary and Dupuis as a corollary. We then develop an analogous variational characterization of the Rényi divergence rates between two stationary finite state Markov chains in terms of relative entropy rates. When combined with Varadhan's variational characterization of the spectral radius of square matrices with nonnegative entries in terms of relative entropy, this yields an analog of the variational formula of Atar, Chowdary and Dupuis in the framework of stationary finite state Markov chains.
The divergence minimization problem plays an important role in various fields. In this note, we focus on differentiable and strictly convex divergences. For some minimization problems, we show the minimizer … The divergence minimization problem plays an important role in various fields. In this note, we focus on differentiable and strictly convex divergences. For some minimization problems, we show the minimizer conditions and the uniqueness of the minimizer without assuming a specific form of divergences. Furthermore, we show geometric properties related to the minimization problems.
Rare events play a key role in many applications and numerous algorithms have been proposed for estimating the probability of a rare event. However, relatively little is known on how … Rare events play a key role in many applications and numerous algorithms have been proposed for estimating the probability of a rare event. However, relatively little is known on how to quantify the sensitivity of the rare event's probability with respect to model parameters. In this paper, instead of the direct statistical estimation of rare event sensitivities, we develop novel and general uncertainty quantification and sensitivity bounds which are not tied to specific rare event simulation methods and which apply to families of rare events. Our method is based on a recently derived variational representation for the family of Rényi divergences in terms of risk sensitive functionals associated with the rare events under consideration. Inspired by the derived bounds, we propose new sensitivity indices for rare events and relate them to the moment generating function of the score function. The bounds scale in such a way that we additionally develop sensitivity indices for large deviation rate functions.
The Kullback-Leibler (KL) divergence is a discrepancy measure between probability distribution that plays a central role in information theory, statistics and machine learning. While there are numerous methods for estimating … The Kullback-Leibler (KL) divergence is a discrepancy measure between probability distribution that plays a central role in information theory, statistics and machine learning. While there are numerous methods for estimating this quantity from data, a limit distribution theory which quantifies fluctuations of the estimation error is largely obscure. In this paper, we close this gap by identifying sufficient conditions on the population distributions for the existence of distributional limits and characterizing the limiting variables. These results are used to derive one- and two-sample limit theorems for Gaussian-smoothed KL divergence, both under the null and the alternative. Finally, an application of the limit distribution result to auditing differential privacy is proposed and analyzed for significance level and power against local alternatives.
We discuss unbiased estimating equations in a class of objective functions using a monotonically increasing function <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f</i> and Bregman divergence. The choice of the function <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f</i> … We discuss unbiased estimating equations in a class of objective functions using a monotonically increasing function <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f</i> and Bregman divergence. The choice of the function <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f</i> gives desirable properties, such as robustness against outliers. To obtain unbiased estimating equations, analytically intractable integrals are generally required as bias correction terms. In this study, we clarify the combination of Bregman divergence, statistical model, and function <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f</i> in which the bias correction term vanishes. Focusing on Mahalanobis and Itakura–Saito distances, we generalize fundamental existing results and characterize a class of distributions of positive reals with a scale parameter, including the gamma distribution as a special case. We also generalized these results to general model classes characterized by one-dimensional Bregman divergence. Furthermore, we discuss the possibility of latent bias minimization when the proportion of outliers is large, which is induced by the extinction of the bias correction term. We conducted numerical experiments to show that the latent bias can approach zero under heavy contamination of outliers or very small inliers.
Fusing probabilistic information is a fundamental task in signal and data processing with relevance to many fields of technology and science. In this work, we investigate the fusion of multiple … Fusing probabilistic information is a fundamental task in signal and data processing with relevance to many fields of technology and science. In this work, we investigate the fusion of multiple probability density functions (pdfs) of a continuous random variable or vector. Although the case of continuous random variables and the problem of pdf fusion frequently arise in multisensor signal processing, statistical inference, and machine learning, a universally accepted method for pdf fusion does not exist. The diversity of approaches, perspectives, and solutions related to pdf fusion motivates a unified presentation of the theory and methodology of the field. We discuss three different approaches to fusing pdfs. In the axiomatic approach, the fusion rule is defined indirectly by a set of properties (axioms). In the optimization approach, it is the result of minimizing an objective function that involves an information-theoretic divergence or a distance measure. In the supra-Bayesian approach, the fusion center interprets the pdfs to be fused as random observations. Our work is partly a survey, reviewing in a structured and coherent fashion many of the concepts and methods that have been developed in the literature. In addition, we present new results for each of the three approaches. Our original contributions include new fusion rules, axioms, and axiomatic and optimization-based characterizations; a new formulation of supra-Bayesian fusion in terms of finite-dimensional parametrizations; and a study of supra-Bayesian fusion of posterior pdfs for linear Gaussian models.
The problem of estimating the Kullback-Leibler divergence D(P∥Q) between two unknown distributions P and Q is studied, under the assumption that the alphabet size k of the distributions can scale … The problem of estimating the Kullback-Leibler divergence D(P∥Q) between two unknown distributions P and Q is studied, under the assumption that the alphabet size k of the distributions can scale to infinity. The estimation is based on m independent samples drawn from P and n independent samples drawn from Q. It is first shown that there does not exist any consistent estimator that guarantees asymptotically small worst case quadratic risk over the set of all pairs of distributions. A restricted set that contains pairs of distributions, with density ratio bounded by a function f (k) is further considered. An augmented plug-in estimator is proposed, and its worst case quadratic risk is shown to be within a constant factor of ((k/m) + (kf (k)/n)) <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> + (log <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> f (k)/m) + ( f (k)/n), if m and n exceed a constant factor of k and kf (k), respectively. Moreover, the minimax quadratic risk is characterized to be within a constant factor of ((k/(m log k)) + (kf (k)/(n log k))) <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> + (log <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> f (k)/m) + ( f (k)/n), if m and n exceed a constant factor of k/ log(k) and kf (k)/ log k, respectively. The lower bound on the minimax quadratic risk is characterized by employing a generalized Le Cam's method. A minimax optimal estimator is then constructed by employing both the polynomial approximation and the plug-in approaches.
We introduce a new family via the log mean of an underlying distribution and as baseline the proportional hazards model and derive some important properties. A special model is proposed … We introduce a new family via the log mean of an underlying distribution and as baseline the proportional hazards model and derive some important properties. A special model is proposed by taking the Weibull for the baseline. We derive several properties of the sub-model such as moments, order statistics, hazard function, survival regression and certain characterization results. We estimate the parameters using frequentist and Bayesian approaches. Further, Bayes estimators, posterior risks, credible intervals and highest posterior density intervals are obtained under different symmetric and asymmetric loss functions. A Monte Carlo simulation study examines the biases and mean square errors of the maximum likelihood estimators. For the illustrative purposes, we consider heart transplant and bladder cancer data sets and investigate the efficiency of proposed model.
Fluid dynamical simulations are often performed using cheap macroscopic models like the Euler equations. For rarefied gases under near-equilibrium conditions, however, macroscopic models are not sufficiently accurate and a simulation … Fluid dynamical simulations are often performed using cheap macroscopic models like the Euler equations. For rarefied gases under near-equilibrium conditions, however, macroscopic models are not sufficiently accurate and a simulation using more accurate microscopic models is often expensive. In this paper, we introduce a hierarchical micro-macro acceleration based on moment models that combines the speed of macroscopic models and the accuracy of microscopic models. The hierarchical micro-macro acceleration is based on a flexible four step procedure including a micro step, restriction step, macro step, and matching step. We derive several new micro-macro methods from that and compare to existing methods. In 1D and 2D test cases, the new methods achieve high accuracy and a large speedup.
In Gaussian graphical models, the zero entries in the precision matrix determine the dependence structure, so estimating that sparse precision matrix and, thereby, learning this underlying structure, is an important … In Gaussian graphical models, the zero entries in the precision matrix determine the dependence structure, so estimating that sparse precision matrix and, thereby, learning this underlying structure, is an important and challenging problem. We propose an empirical version of the $G$-Wishart prior for sparse precision matrices, where the prior mode is informed by the data in a suitable way. Paired with a prior on the graph structure, a marginal posterior distribution for the same is obtained that takes the form of a ratio of two $G$-Wishart normalizing constants. We show that this ratio can be easily and accurately computed using a Laplace approximation, which leads to fast and efficient posterior sampling even in high-dimensions. Numerical results demonstrate the proposed method's superior performance, in terms of speed and accuracy, across a variety of settings, and theoretical support is provided in the form of a posterior concentration rate theorem.
The conventional channel resolvability refers to the minimum rate needed for an input process to approximate the channel output distribution in total variation distance. In this paper, we study E … The conventional channel resolvability refers to the minimum rate needed for an input process to approximate the channel output distribution in total variation distance. In this paper, we study E <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">γ</sub> -resolvability, in which total variation is replaced by the more general E <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">γ</sub> distance. A general one-shot achievability bound for the precision of such an approximation is developed. Let Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X|U</sub> be a random transformation, n be an integer, and E ∈ (0, +∞). We show that in the asymptotic setting where γ = exp(nE), a (nonnegative) randomness rate above inf Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">U</sub> :D <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(QXIIπX)≤E</sub> {D(Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</sub> IIπ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</sub> ) + I(Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">U</sub> , Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X|U</sub> ) - E} is sufficient to approximate the output distribution π <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</sub> <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">⊗n</sup> using the channel Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X|U</sub> <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">⊗n</sup> , where Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">U</sub> → Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X|U</sub> → Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</sub> , and is also necessary in the case of finite U and X . In particular, a randomness rate of inf Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">U</sub> I(Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">U</sub> , Q <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X|U</sub> ) - E is always sufficient. We also study the convergence of the approximation error under the high-probability criteria in the case of random codebooks. Moreover, by developing simple bounds relating E <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">γ</sub> and other distance measures, we are able to determine the exact linear growth rate of the approximation errors measured in relative entropy and smooth Rényi divergences for a fixed-input randomness rate. The new resolvability result is then used to derive: 1) a one-shot upper bound on the probability of excess distortion in lossy compression, which is exponentially tight in the i.i.d. setting; 2) a one-shot version of the mutual covering lemma; and 3) a lower bound on the size of the eavesdropper list to include the actual message and a lower bound on the eavesdropper false-alarm probability in the wiretap channel problem, which is (asymptotically) ensemble-tight.
Motivated by the method of interpolating inequalities that makes use of the improved Jensen-type inequalities, in this paper we integrate this approach with the well known Zipf–Mandelbrot law applied to … Motivated by the method of interpolating inequalities that makes use of the improved Jensen-type inequalities, in this paper we integrate this approach with the well known Zipf–Mandelbrot law applied to various types of f-divergences and distances, such are Kullback–Leibler divergence, Hellinger distance, Bhattacharyya distance (via coefficient), $\chi^{2}$ -divergence, total variation distance and triangular discrimination. Addressing these applications, we firstly deduce general results of the type for the Csiszár divergence functional from which the listed divergences originate. When presenting the analyzed inequalities for the Zipf–Mandelbrot law, we accentuate its special form, the Zipf law with its specific role in linguistics. We introduce this aspect through the Zipfian word distribution associated to the English and Russian languages, using the obtained bounds for the Kullback–Leibler divergence.
Abstract Summary Multiplex imaging platforms have become popular for studying complex single-cell biology in the tumor microenvironment (TME) of cancer subjects. Studying the intensity of the proteins that regulate important … Abstract Summary Multiplex imaging platforms have become popular for studying complex single-cell biology in the tumor microenvironment (TME) of cancer subjects. Studying the intensity of the proteins that regulate important cell-functions becomes extremely crucial for subject-specific assessment of risks. The conventional approach requires selection of two thresholds, one to define the cells of the TME as positive or negative for a particular protein, and the other to classify the subjects based on the proportion of the positive cells. We present a threshold-free approach in which distance between a pair of subjects is computed based on the probability density of the protein in their TMEs. The distance matrix can either be used to classify the subjects into meaningful groups or can directly be used in a kernel machine regression framework for testing association with clinical outcomes. The method gets rid of the subjectivity bias of the thresholding-based approach, enabling easier but interpretable analysis. We analyze a lung cancer dataset, finding the difference in the density of protein HLA-DR to be significantly associated with the overall survival and a triple-negative breast cancer dataset, analyzing the effects of multiple proteins on survival and recurrence. The reliability of our method is demonstrated through extensive simulation studies. Availability and implementation The associated R package can be found here, https://github.com/sealx017/DenVar. Supplementary information Supplementary data are available at Bioinformatics Advances online.
In this paper, we consider a recently-proposed model of teaching and learning under uncertainty, in which a teacher receives independent observations of a single bit corrupted by binary symmetric noise, … In this paper, we consider a recently-proposed model of teaching and learning under uncertainty, in which a teacher receives independent observations of a single bit corrupted by binary symmetric noise, and sequentially transmits to a student through another binary symmetric channel based on the bits observed so far. After a given number $n$ of transmissions, the student outputs an estimate of the unknown bit, and we are interested in the exponential decay rate of the error probability as $n$ increases. We propose a novel block-structured teaching strategy in which the teacher encodes the number of 1s received in each block, and show that the resulting error exponent is the binary relative entropy $D\big(\frac{1}{2}\|\max(p,q)\big)$, where $p$ and $q$ are the noise parameters. This matches a trivial converse result based on the data processing inequality, and settles two conjectures of [Jog and Loh, 2021] and [Huleihel \emph{et al.}, 2019]. In addition, we show that the computation time required by the teacher and student is linear in $n$. We also study a more general setting in which the binary symmetric channels are replaced by general binary-input discrete memoryless channels. We provide an achievability bound and a converse bound, and show that the two coincide in certain cases, including (i) when the two channels are identical, and (ii) when the student-teacher channel is a binary symmetric channel. More generally, we give sufficient conditions under which our achievable learning rate is the best possible for block-structured protocols.
It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a … It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct "gene tree." Although the gene tree may deviate from the "species tree" due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a non-standard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is "adaptive fast converging," meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree.
Reliable Machine Learning via Structured Distributionally Robust Optimization Data sets used to train machine learning (ML) models often suffer from sampling biases and underrepresent marginalized groups. Standard machine learning models … Reliable Machine Learning via Structured Distributionally Robust Optimization Data sets used to train machine learning (ML) models often suffer from sampling biases and underrepresent marginalized groups. Standard machine learning models are trained to optimize average performance and perform poorly on tail subpopulations. In “Distributionally Robust Losses for Latent Covariate Mixtures,” John Duchi, Tatsunori Hashimoto, and Hongseok Namkoong formulate a DRO approach for training ML models to perform uniformly well over subpopulations. They design a worst case optimization procedure over structured distribution shifts salient in predictive applications: shifts in (a subset of) covariates. The authors propose a convex procedure that controls worst case subpopulation performance and provide finite-sample (nonparametric) convergence guarantees. Empirically, they demonstrate their worst case procedure on lexical similarity, wine quality, and recidivism prediction tasks and observe significantly improved performance across unseen subpopulations.
In this paper, we provide new insights on the Unadjusted Langevin Algorithm. We show that this method can be formulated as a first order optimization algorithm of an objective functional … In this paper, we provide new insights on the Unadjusted Langevin Algorithm. We show that this method can be formulated as a first order optimization algorithm of an objective functional defined on the Wasserstein space of order $2$. Using this interpretation and techniques borrowed from convex optimization, we give a non-asymptotic analysis of this method to sample from logconcave smooth target distribution on $\mathbb{R}^d$. Based on this interpretation, we propose two new methods for sampling from a non-smooth target distribution, which we analyze as well. Besides, these new algorithms are natural extensions of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, which is a popular extension of the Unadjusted Langevin Algorithm. Similar to SGLD, they only rely on approximations of the gradient of the target log density and can be used for large-scale Bayesian inference.
In high-dimensional Bayesian applications, choosing a prior such that the corresponding posterior distribution has desired asymptotic concentration properties can be challenging. This paper develops a general strategy for constructing empirical … In high-dimensional Bayesian applications, choosing a prior such that the corresponding posterior distribution has desired asymptotic concentration properties can be challenging. This paper develops a general strategy for constructing empirical or data-dependent priors whose corresponding posterior distributions have targeted, often optimal, concentration rates. In essence, the idea is to place a prior which has sufficient mass on parameter values for which the likelihood is suitably large. This makes the asymptotic properties of the posterior less sensitive to the shape of the prior which, in turn, allows users to work with priors of convenient forms while maintaining the desired posterior concentration rates. General results on both adaptive and non-adaptive rates based on empirical priors are presented, along with illustrations in density estimation, nonparametric regression, and high-dimensional structured normal models.
Modern applications of Bayesian inference involve models that are sufficiently complex that the corresponding posterior distributions are intractable and must be approximated. The most common approximation is based on Markov … Modern applications of Bayesian inference involve models that are sufficiently complex that the corresponding posterior distributions are intractable and must be approximated. The most common approximation is based on Markov chain Monte Carlo, but these can be expensive when the data set is large and/or the model is complex, so more efficient variational approximations have recently received considerable attention. The traditional variational methods, that seek to minimize the Kullback--Leibler divergence between the posterior and a relatively simple parametric family, provide accurate and efficient estimation of the posterior mean, but often does not capture other moments, and have limitations in terms of the models to which they can be applied. Here we propose the construction of variational approximations based on minimizing the Fisher divergence, and develop an efficient computational algorithm that can be applied to a wide range of models without conjugacy or potentially unrealistic mean-field assumptions. We demonstrate the superior performance of the proposed method for the benchmark case of logistic regression.
Inference on high-dimensional parameters in structured linear models is an important statistical problem. This paper focuses on the case of a piecewise polynomial Gaussian sequence model, and we develop a … Inference on high-dimensional parameters in structured linear models is an important statistical problem. This paper focuses on the case of a piecewise polynomial Gaussian sequence model, and we develop a new empirical Bayes solution that enjoys adaptive minimax posterior concentration rates and improved structure learning properties compared to existing methods. Moreover, thanks to the conjugate form of the empirical prior, posterior computations are fast and easy. Numerical examples also highlight the method's strong finite-sample performance compared to existing methods across a range of different scenarios.
In this paper, we derive sharp nonlinear dimension-free Brascamp--Lieb inequalities (including hypercontractivity inequalities) for distributions on Polish spaces, which strengthen the classic Brascamp--Lieb inequalities. Applications include the extension of Mrs. … In this paper, we derive sharp nonlinear dimension-free Brascamp--Lieb inequalities (including hypercontractivity inequalities) for distributions on Polish spaces, which strengthen the classic Brascamp--Lieb inequalities. Applications include the extension of Mrs. Gerber's lemma to the cases of R\'enyi divergences and distributions on Polish spaces, the strengthening of small-set expansion theorems, and the characterization of the exponent of the $q$-stability. Our proofs in this paper are based on information-theoretic and coupling techniques.
The increasing penetration of renewables and the constraints posed by pan-European market make more and more crucial the need to evaluate the dynamic behaviour of the whole grid and to … The increasing penetration of renewables and the constraints posed by pan-European market make more and more crucial the need to evaluate the dynamic behaviour of the whole grid and to cope with forecast uncertainties from operational planning to online environment. The FP7 EU project iTesla addresses these needs and encompasses several major objectives, including the definition of a platform architecture, a dynamic data structure, and dynamic model validation. The on line security assessment is characterised by a multi-stage filtering process: this includes a "Monte Carlo like approach" which applies the security rules derived from extensive security analyses performed offline to a set of "new base cases" sampled around the power system (PS) forecast state with the aim to discard as many stable contingencies as possible. The paper will focus on the management of historical data - related to stochastic renewable and load snapshots and forecasts-in order to solve some intrinsic criticalities of raw data and to derive a reliable model of the multivariate distributions of renewables and loads conditioned to the specific forecast state of the grid, with the final aim to generate the "uncertainty region" of states around the forecast state.
The following problem is considered: given a joint distribution P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">XY</sub> and an event E, bound P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">XY</sub> (E) in terms of P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</sub> … The following problem is considered: given a joint distribution P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">XY</sub> and an event E, bound P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">XY</sub> (E) in terms of P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</sub> P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Y</sub> (E) (where P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</sub> P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Y</sub> is the product of the marginals of P <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">XY</sub> ) and a measure of dependence of X and Y. Such bounds have direct applications in the analysis of the generalization error of learning algorithms, where E represents a large error event and the measure of dependence controls the degree of overfitting. Herein, bounds are demonstrated using several information-theoretic metrics, in particular: mutual information, lautum information, maximal leakage, and J <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">∞</sub> . The mutual information bound can outperform comparable bounds in the literature by an arbitrarily large factor.
We explore properties of the $\chi^{2}$ and Rényi distances to the normal law and in particular propose necessary and sufficient conditions under which these distances tend to zero in the … We explore properties of the $\chi^{2}$ and Rényi distances to the normal law and in particular propose necessary and sufficient conditions under which these distances tend to zero in the central limit theorem (with exact rates with respect to the increasing number of summands).
In this work, we propose two information generating functions: general weighted information and relative information generating functions, and study their properties. It is shown that the general weighted information generating … In this work, we propose two information generating functions: general weighted information and relative information generating functions, and study their properties. It is shown that the general weighted information generating function (GWIGF) is shift-dependent and can be expressed in terms of the weighted Shannon entropy. The GWIGF of a transformed random variable has been obtained in terms of the GWIGF of a known distribution. Several bounds of the GWIGF have been proposed. We have obtained sufficient conditions under which the GWIGFs of two distributions are comparable. Further, we have established a connection between the weighted varentropy and varentropy with proposed GWIGF. An upper bound for GWIGF of the sum of two independent random variables is derived. The effect of general weighted relative information generating function (GWRIGF) for two transformed random variables under strictly monotone functions has been studied. Further, these information generating functions are studied for escort, generalized escort and mixture distributions. Specially, we propose weighted β-cross informational energy and establish a close connection with GWIGF for escort distribution. The residual versions of the newly proposed generating functions are considered and several similar properties have been explored. A non-parametric estimator of the residual general weighted information generating function is proposed. A simulated data set and two real data sets are considered for the purpose of illustration. Finally, we have compared the non-parametric approach with a parametric approach in terms of the absolute bias and mean squared error values.
In this note, we present an information diffusion inequality derived from an elementary argument, which gives rise to a very general Fano-type inequality. The latter unifies and generalizes the distance-based … In this note, we present an information diffusion inequality derived from an elementary argument, which gives rise to a very general Fano-type inequality. The latter unifies and generalizes the distance-based Fano inequality and the continuous Fano inequality established in [Corollary 1, Propositions 1 and 2, arXiv:1311.2669v2], as well as the generalized Fano inequality in [Equation following (10); T. S. Han and S. Verdú. Generalizing the Fano inequality. IEEE Transactions on Information Theory, 40(4):1247-1251, July 1994].
The Kullback-Leibler divergence rate of two finite or countable ergodic Markov chains is well-known to exist and have an explicit expression as a function of the transition matrices of the … The Kullback-Leibler divergence rate of two finite or countable ergodic Markov chains is well-known to exist and have an explicit expression as a function of the transition matrices of the chains, allowing access to classical tools for applications, such as minimization under constraints or projections on convex sets. The existence of Rényi divergence rates of ergodic Markov chains has been established in [5], [12]; here we establish explicit expressions for them and prove some properties of the resulting measures of discrepancy between stochastic matrices, opening the way to applications.
The problem of robust hypothesis testing is studied, where under the null and the alternative hypotheses, the data-generating distributions are assumed to be in some uncertainty sets, and the goal … The problem of robust hypothesis testing is studied, where under the null and the alternative hypotheses, the data-generating distributions are assumed to be in some uncertainty sets, and the goal is to design a test that performs well under the worst-case distributions over the uncertainty sets. In this paper, uncertainty sets are constructed in a data-driven manner using kernel method, i.e., they are centered around empirical distributions of training samples from the null and alternative hypotheses, respectively; and are constrained via the distance between kernel mean embeddings of distributions in the reproducing kernel Hilbert space, i.e., maximum mean discrepancy (MMD). The Bayesian setting and the Neyman-Pearson setting are investigated. For the Bayesian setting where the goal is to minimize the worst-case error probability, an optimal test is firstly obtained when the alphabet is finite. When the alphabet is infinite, a tractable approximation is proposed to quantify the worst-case average error probability, and a kernel smoothing method is further applied to design test that generalizes to unseen samples. A direct robust kernel test is also proposed and proved to be exponentially consistent. For the Neyman-Pearson setting, where the goal is to minimize the worst-case probability of miss detection subject to a constraint on the worst-case probability of false alarm, an efficient robust kernel test is proposed and is shown to be asymptotically optimal. Numerical results are provided to demonstrate the performance of the proposed robust tests.
Motivated by a recent result by van Erven and Harremoës, we study a forward projection problem for the Rényi divergence on a particular α-convex set, termed α-linear family. The solution … Motivated by a recent result by van Erven and Harremoës, we study a forward projection problem for the Rényi divergence on a particular α-convex set, termed α-linear family. The solution to this problem yields a parametric family of probability measures which turns out to be an extension of the exponential family, and it is termed α-exponential family. An orthogonality relationship between the α-exponential and α-linear families is first established and is then used to transform the reverse projection on an α-exponential family into a forward projection on an α-linear family. The full paper version of this work is available on the arXiv at http://arxiv.org/abs/1512.02515.
Dempster–Shafer evidence theory, which is an extension of Bayesian probability theory, is a useful approach to realize multisensor data fusion. It uses mass functions to represent uncertainty, which can produce … Dempster–Shafer evidence theory, which is an extension of Bayesian probability theory, is a useful approach to realize multisensor data fusion. It uses mass functions to represent uncertainty, which can produce a satisfactory fusion result. However, when the evidence is highly conflicting, using Dempster–Shafer evidence theory fusion rule to combine the evidence will generate the result contrary to common sense. To solve this issue, we propose a new method for conflict management based on Renyi divergence (RD). Then, by combining RD with the mass function, we develop Renyi-Belief divergence (RBD). To expand its utility, we modify it and define the modified Renyi-Belief divergence (MRBD). Our method MRBD integrates the characteristics of mass functions and can handle conflict by measuring the differences between mass functions. Experiments show that MRBD can effectively deal with conflicts. After dealing with the conflicting evidence, we realize multisensor data fusion based on the Dempster–Shafer combination rule. Moreover, we also consider the information quality and belief entropy to reinforce the credibility of evidence. A large number of examples show that the proposed method is feasible and efficient. Finally, in the application of fault diagnosis, our method can effectively determine the fault type.
This paper studies forward and reverse projections for the Rényi divergence of order α E (0, ∞) on α-convex sets. The forward projection on such a set is motivated by … This paper studies forward and reverse projections for the Rényi divergence of order α E (0, ∞) on α-convex sets. The forward projection on such a set is motivated by some works of Tsallis et al. in statistical physics, and the reverse projection is motivated by robust statistics. In a recent work, van Erven and Harremoës proved a Pythagorean inequality for Rényi divergences on α-convex sets under the assumption that the forward projection exists. Continuing this study, a sufficient condition for the existence of a forward projection is proved for probability measures on a general alphabet. For α <b xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ϵ</b> (1, ∞), the proof relies on a new Apollonius theorem for the Hellinger divergence, and for α E (0, 1), the proof relies on the Banach- Alaoglu theorem from the functional analysis. Further projection results are then obtained in the finite alphabet setting. These include a projection theorem on a specific α-convex set, which is termed an α-linear family, generalizing a result by Csiszar to α ≠ 1. The solution to this problem yields a parametric family of probability measures, which turns out to be an extension of the exponential family, and it is termed an α-exponential family. An orthogonality relationship between the α-exponential and α-linear families is established, and it is used to turn the reverse projection on an α-exponential family into a forward projection on an α-linear family. This paper also proves a convergence result of an iterative procedure used to calculate the forward projection on an intersection of a finite number of α-linear families.
Relative entropies play important roles in classical and quantum information theory. In this paper, we discuss the sandwiched Rényi relative entropy for [Formula: see text] on [Formula: see text] (the … Relative entropies play important roles in classical and quantum information theory. In this paper, we discuss the sandwiched Rényi relative entropy for [Formula: see text] on [Formula: see text] (the cone of positive trace-class operators acting on an infinite-dimensional complex Hilbert space [Formula: see text]) and characterize all surjective maps preserving the sandwiched Rényi relative entropy on [Formula: see text]. Such transformations have the form [Formula: see text] for each [Formula: see text], where [Formula: see text] and [Formula: see text] is either a unitary or an anti-unitary operator on [Formula: see text]. Particularly, all surjective maps preserving sandwiched Rényi relative entropy on [Formula: see text] (the set of all quantum states on [Formula: see text]) are necessarily implemented by either a unitary or an anti-unitary operator.
We study the asymptotic consistency properties of $\alpha$-Renyi approximate posteriors, a class of variational Bayesian methods that approximate an intractable Bayesian posterior with a member of a tractable family of … We study the asymptotic consistency properties of $\alpha$-Renyi approximate posteriors, a class of variational Bayesian methods that approximate an intractable Bayesian posterior with a member of a tractable family of distributions, the member chosen to minimize the $\alpha$-Renyi divergence from the true posterior. Unique to our work is that we consider settings with $\alpha > 1$, resulting in approximations that upperbound the log-likelihood, and consequently have wider spread than traditional variational approaches that minimize the Kullback-Liebler (KL) divergence from the posterior. Our primary result identifies sufficient conditions under which consistency holds, centering around the existence of a 'good' sequence of distributions in the approximating family that possesses, among other properties, the right rate of convergence to a limit distribution. We further characterize the good sequence by demonstrating that a sequence of distributions that converges too quickly cannot be a good sequence. We also extend our analysis to the setting where $\alpha$ equals one, corresponding to the minimizer of the reverse KL divergence, and to models with local latent variables. We also illustrate the existence of good sequence with a number of examples. Our results complement a growing body of work focused on the frequentist properties of variational Bayesian methods.
Data with a multimodal pattern can be analyzed using a mixture model. In a mixture model, the most important step is the determination of the number of mixture components, because … Data with a multimodal pattern can be analyzed using a mixture model. In a mixture model, the most important step is the determination of the number of mixture components, because finding the correct number of mixture components will reduce the error of the resulting model. In a Bayesian analysis, one method that can be used to determine the number of mixture components is the reversible jump Markov chain Monte Carlo (RJMCMC). The RJMCMC is used for distributions that have location and scale parameters or location-scale distribution, such as the Gaussian distribution family. In this research, we added an important step before beginning to use the RJMCMC method, namely the modification of the analyzed distribution into location-scale distribution. We called this the non-Gaussian RJMCMC (NG-RJMCMC) algorithm. The following steps are the same as for the RJMCMC. In this study, we applied it to the Weibull distribution. This will help many researchers in the field of survival analysis since most of the survival time distribution is Weibull. We transformed the Weibull distribution into a location-scale distribution, which is the extreme value (EV) type 1 (Gumbel-type for minima) distribution. Thus, for the mixture analysis, we call this EV-I mixture distribution. Based on the simulation results, we can conclude that the accuracy level is at minimum 95%. We also applied the EV-I mixture distribution and compared it with the Gaussian mixture distribution for enzyme, acidity, and galaxy datasets. Based on the Kullback–Leibler divergence (KLD) and visual observation, the EV-I mixture distribution has higher coverage than the Gaussian mixture distribution. We also applied it to our dengue hemorrhagic fever (DHF) data from eastern Surabaya, East Java, Indonesia. The estimation results show that the number of mixture components in the data is four; we also obtained the estimation results of the other parameters and labels for each observation. Based on the Kullback–Leibler divergence (KLD) and visual observation, for our data, the EV-I mixture distribution offers better coverage than the Gaussian mixture distribution.
This book is an introduction to the field of asymptotic statistics. The treatment is both practical and mathematically rigorous. In addition to most of the standard topics of an asymptotics … This book is an introduction to the field of asymptotic statistics. The treatment is both practical and mathematically rigorous. In addition to most of the standard topics of an asymptotics course, including likelihood inference, M-estimation, the theory of asymptotic efficiency, U-statistics, and rank procedures, the book also presents recent research topics such as semiparametric models, the bootstrap, and empirical processes and their applications. The topics are organized from the central idea of approximation by limit experiments, which gives the book one of its unifying themes. This entails mainly the local approximation of the classical i.i.d. set up with smooth parameters by location experiments involving a single, normally distributed observation. Thus, even the standard subjects of asymptotic statistics are presented in a novel way. Suitable as a graduate or Master's level statistics text, this book will also give researchers an overview of research in asymptotic statistics.
1.1. Introduction.- 1.2. Outer Integrals and Measurable Majorants.- 1.3. Weak Convergence.- 1.4. Product Spaces.- 1.5. Spaces of Bounded Functions.- 1.6. Spaces of Locally Bounded Functions.- 1.7. The Ball Sigma-Field and … 1.1. Introduction.- 1.2. Outer Integrals and Measurable Majorants.- 1.3. Weak Convergence.- 1.4. Product Spaces.- 1.5. Spaces of Bounded Functions.- 1.6. Spaces of Locally Bounded Functions.- 1.7. The Ball Sigma-Field and Measurability of Suprema.- 1.8. Hilbert Spaces.- 1.9. Convergence: Almost Surely and in Probability.- 1.10. Convergence: Weak, Almost Uniform, and in Probability.- 1.11. Refinements.- 1.12. Uniformity and Metrization.- 2.1. Introduction.- 2.2. Maximal Inequalities and Covering Numbers.- 2.3. Symmetrization and Measurability.- 2.4. Glivenko-Cantelli Theorems.- 2.5. Donsker Theorems.- 2.6. Uniform Entropy Numbers.- 2.7. Bracketing Numbers.- 2.8. Uniformity in the Underlying Distribution.- 2.9. Multiplier Central Limit Theorems.- 2.10. Permanence of the Donsker Property.- 2.11. The Central Limit Theorem for Processes.- 2.12. Partial-Sum Processes.- 2.13. Other Donsker Classes.- 2.14. Tail Bounds.- 3.1. Introduction.- 3.2. M-Estimators.- 3.3. Z-Estimators.- 3.4. Rates of Convergence.- 3.5. Random Sample Size, Poissonization and Kac Processes.- 3.6. The Bootstrap.- 3.7. The Two-Sample Problem.- 3.8. Independence Empirical Processes.- 3.9. The Delta-Method.- 3.10. Contiguity.- 3.11. Convolution and Minimax Theorems.- A. Appendix.- A.1. Inequalities.- A.2. Gaussian Processes.- A.2.1. Inequalities and Gaussian Comparison.- A.2.2. Exponential Bounds.- A.2.3. Majorizing Measures.- A.2.4. Further Results.- A.3. Rademacher Processes.- A.4. Isoperimetric Inequalities for Product Measures.- A.5. Some Limit Theorems.- A.6. More Inequalities.- A.6.1. Binomial Random Variables.- A.6.2. Multinomial Random Vectors.- A.6.3. Rademacher Sums.- Notes.- References.- Author Index.- List of Symbols.
Summary Let p1 and p2 be two probability measures on the same space and let φ be the generalized Radon-Nikodym derivative of p2 with respect to p1. If C is … Summary Let p1 and p2 be two probability measures on the same space and let φ be the generalized Radon-Nikodym derivative of p2 with respect to p1. If C is a continuous convex function of a real variable such that the p1-expectation (generalized as in Section 3) of C(φ) provides a reasonable coefficient of the p1-dispersion of φ, then this expectation has basic properties which it is natural to demand of a coefficient of divergence of p2 from p1. A general class of coefficients of divergence is generated in this way and it is shown that various available measures of divergence, distance, discriminatory information, etc., are members of this class.
The lower bound of Cramer and Rao is generalized to pairs of families of probability distributions, one of which is escort to the other. This bound is optimal for certain … The lower bound of Cramer and Rao is generalized to pairs of families of probability distributions, one of which is escort to the other. This bound is optimal for certain families, called φ-exponential in the paper. Their dual structure is explored. They satisfy a variational principle with respect to an appropriately chosen entropy functional, which is the dual of a free energy functional.
The Rényi information measures are characterized in terms of their Shannon counterparts, and properties of the former are recovered from first principle via the associated properties of the latter. Motivated … The Rényi information measures are characterized in terms of their Shannon counterparts, and properties of the former are recovered from first principle via the associated properties of the latter. Motivated by this characterization, a two-sensor composite hypothesis testing problem is presented, and the optimal worst case miss-detection exponent is obtained in terms of a Rényi divergence.
Introduction, von Neumann's minimax theorem [10] can be stated as follows : if M and N are finite dimensional simplices and / is a bilinear function on MxN, then / … Introduction, von Neumann's minimax theorem [10] can be stated as follows : if M and N are finite dimensional simplices and / is a bilinear function on MxN, then / has a saddle point, i. e.There have been several generalizations of this theorem.J. Ville [9], A. Wald [11], and others [1] variously extended von Neumann's result to cases where M and N were allowed to be subsets of certain infinite dimensional linear spaces.The functions / they considered, however, were still linear.M. Shiffman [8] seems to have been the first to have considered concave-convex functions in a minimax theorem.H. Kneser [6], K. Fan [3], and C. Berge [2] (using induction and the method of separating two disjoint convex sets in Euclidean space by a hyperplane) got minimax theorems for concave-convex functions that are appropriately semi-continuous in one of the two variables.Although these theorems include the previous results as special cases, they can also be shown to be rather direct consequences of von Neumann's theorem.H. Nikaidό [7], on the other hand, using Brouwer's fixed point theorem, proved the existence of a saddle point for functions satisfying the weaker algebraic condition of being quasi-concave-convex, but the stronger topological condition of being continuous in each variable.Thus, there seem to be essentially two types of argument: one uses some form of separation of disjoint convex sets by a hyperplane and yields the theorem of Kneser-Fan (see 4.2), and the other uses a fixed point theorem and yields Nikaidό's result.ΐn this paper, we unify the two streams of thought by proving a minimax theorem for a function that is quasi-concave-convex and appropriately semi-continuous in each variable.The method of proof differs radically from any used previously.The difficulty lies in the fact that we cannot use a fixed point theorem (due to lack of continuity) nor the separation of disjoint convex sets by a hyperplane (due to lack of convexity).The key tool used is a theorem due to Knaster, Kuratowski, Mazurkiewicz based on Sperner's lemma.It may be of some interest to point out that, in all the minimax theorems, the crucial argument is carried out on spaces M and N that
Exact forms of some invariants for distributions admitting sufficient statistics Get access V. S. HUZURBAZAR V. S. HUZURBAZAR University of PoonaIndia Search for other works by this author on: Oxford … Exact forms of some invariants for distributions admitting sufficient statistics Get access V. S. HUZURBAZAR V. S. HUZURBAZAR University of PoonaIndia Search for other works by this author on: Oxford Academic Google Scholar Biometrika, Volume 42, Issue 3-4, December 1955, Pages 533–537, https://doi.org/10.1093/biomet/42.3-4.533 Published: 01 December 1955
Consider independent identically distributed observations whose distribution depends on a parameter $\theta$. Measure the distance between two parameter points $\theta_1, \theta_2$ by the Hellinger distance $h(\theta_1, \theta_2)$. Suppose that for … Consider independent identically distributed observations whose distribution depends on a parameter $\theta$. Measure the distance between two parameter points $\theta_1, \theta_2$ by the Hellinger distance $h(\theta_1, \theta_2)$. Suppose that for $n$ observations there is a good but not perfect test of $\theta_0$ against $\theta_n$. Then $n^{\frac{1}{2}}h(\theta_0, \theta_n)$ stays away from zero and infinity. The usual parametric examples, regular or irregular, also have the property that there are estimates $\hat{\theta}_n$ such that $n^{\frac{1}{2}}h(\hat{\theta}_n, \theta_0)$ stays bounded in probability, so that rates of separation for tests and estimates are essentially the same. The present paper shows that need not be true in general but is correct under certain metric dimensionality assumptions on the parameter set. It is then shown that these assumptions imply convergence at the required rate of the Bayes estimates or maximum probability estimates.
Rigorous probabilistic arguments, built on the foundation of measure theory introduced eighty years ago by Kolmogorov, have invaded many fields. Students of statistics, biostatistics, econometrics, finance, and other changing disciplines … Rigorous probabilistic arguments, built on the foundation of measure theory introduced eighty years ago by Kolmogorov, have invaded many fields. Students of statistics, biostatistics, econometrics, finance, and other changing disciplines now find themselves needing to absorb theory beyond what they might have learned in the typical undergraduate, calculus-based probability course. This 2002 book grew from a one-semester course offered for many years to a mixed audience of graduate and undergraduate students who have not had the luxury of taking a course in measure theory. The core of the book covers the basic topics of independence, conditioning, martingales, convergence in distribution, and Fourier transforms. In addition there are numerous sections treating topics traditionally thought of as more advanced, such as coupling and the KMT strong approximation, option pricing via the equivalent martingale measure, and the isoperimetric inequality for Gaussian processes. The book is not just a presentation of mathematical theory, but is also a discussion of why that theory takes its current form. It will be a secure starting point for anyone who needs to invoke rigorous probabilistic arguments and understand what they mean.
The convergence of stochastic processes is defined in terms of the so-called “weak convergence” (w. c.) of probability measures in appropriate functional spaces (c. s. m. s.). Chapter 1. Let … The convergence of stochastic processes is defined in terms of the so-called “weak convergence” (w. c.) of probability measures in appropriate functional spaces (c. s. m. s.). Chapter 1. Let $\Re $ be the c.s.m.s. and v a set of all finite measures on $\Re $. The distance $L(\mu _1 ,\mu _2 )$ (that is analogous to the Lévy distance) is introduced, and equivalence of L-convergence and w. c. is proved. It is shown that $V\Re = (v,L)$ is c. s. m. s. Then, the necessary and sufficient conditions for compactness in $V\Re $ are given. In section 1.6 the concept of “characteristic functionals” is applied to the study of w. cc of measures in Hilbert space. Chapter 2. On the basis of the above results the necessary and sufficient compactness conditions for families of probability measures in spaces $C[0,1]$ and $D[0,1]$ (space of functions that are continuous in $[0,1]$ except for jumps) are formulated. Chapter 3. The general form of the “invariance principle” for the sums of independent random variables is developed. Chapter 4. An estimate of the remainder term in the well-known Kolmogorov theorem is given (cf. [3.1]).
Summary When studying convergence of measures, an important issue is the choice of probability metric. We provide a summary and some new results concerning bounds among some important probability metrics/distances … Summary When studying convergence of measures, an important issue is the choice of probability metric. We provide a summary and some new results concerning bounds among some important probability metrics/distances that are used by statisticians and probabilists. Knowledge of other metrics can provide a means of deriving bounds for another one in an applied problem. Considering other metrics can also provide alternate insights. We also give examples that show that rates of convergence can strongly depend on the metric chosen. Careful consideration is necessary when choosing a metric.
Consider a class $\mathscr{P}={P_\theta:\theta\in\Theta}$ of probability measures on a measurable space $(\mathscr{X},\mathscr{A})$, dominated by a $\sigma$ -finite measure $\mu$. Let $f_\theta=dP_\theta/d_\mu$, $\theta\ in\Theta$, and let $\theta_n$ be a maximum likelihood … Consider a class $\mathscr{P}={P_\theta:\theta\in\Theta}$ of probability measures on a measurable space $(\mathscr{X},\mathscr{A})$, dominated by a $\sigma$ -finite measure $\mu$. Let $f_\theta=dP_\theta/d_\mu$, $\theta\ in\Theta$, and let $\theta_n$ be a maximum likelihood estimator based on n independent observations from $P_{\theta_0}$, $\theta_0\in\Theta$. We use results from empirical process theory to obtain convergence for the Hellinger distance $h(f_{\hat{\theta}_n}, f_{\theta_0})$, under certain entropy conditions on the class of densities ${f_\theta:\theta\in\Theta}$ The examples we present are a model with interval censored observations, smooth densities, monotone densities and convolution models. In most examples, the convexity of the class of densities is of special importance.
Abstract Rényi statistics are considered in a directed family of general exponential models. These statistics are defined as Rényi distances between estimated and hypothetical model. An asymptotically quadratic approximation to … Abstract Rényi statistics are considered in a directed family of general exponential models. These statistics are defined as Rényi distances between estimated and hypothetical model. An asymptotically quadratic approximation to the Rényi statistics is established, leading to similar asymptotic distribution results as established in the literature for the likelihood ratio statistics. Some arguments in favour of the Rényi statistics are discussed, and a numerical comparison of the Rényi goodness-of-fit tests with the likelihood ratio test is presented. Keywords: Exponential modelstesting goodness-of-fitRényi distancesRényi statistics
We consider an extension of ɛ-entropy to a KL-divergence based complexity measure for randomized density estimation methods. Based on this extension, we develop a general information-theoretical inequality that measures the … We consider an extension of ɛ-entropy to a KL-divergence based complexity measure for randomized density estimation methods. Based on this extension, we develop a general information-theoretical inequality that measures the statistical complexity of some deterministic and randomized density estimators. Consequences of the new inequality will be presented. In particular, we show that this technique can lead to improvements of some classical results concerning the convergence of minimum description length and Bayesian posterior distributions. Moreover, we are able to derive clean finite-sample convergence bounds that are not obtainable using previous approaches.
The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given … The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern.This extensive, step-by-step introduction to the MDL Principle provides a comprehensive reference (with an emphasis on conceptual issues) that is accessible to graduate students and researchers in statistics, pattern classification, machine learning, and data mining, to philosophers interested in the foundations of statistics, and to researchers in other applied sciences that involve model selection, including biology, econometrics, and experimental psychology. Part I provides a basic introduction to MDL and an overview of the concepts in statistics and information theory needed to understand MDL. Part II treats universal coding, the information-theoretic notion on which MDL is built, and part III gives a formal treatment of MDL theory as a theory of inductive inference based on universal coding. Part IV provides a comprehensive overview of the statistical theory of exponential families with an emphasis on their information-theoretic properties. The text includes a number of summaries, paragraphs offering the reader a fast track through the material, and boxes highlighting the most important concepts.
We study conditions on $f$ under which an $f$-divergence $D_f$ will satisfy $D_f \geq c_f V^2$ or $D_f \geq c_{2,f} V^2 + c_{4,f} V^4$, where $V$ denotes variational distance and … We study conditions on $f$ under which an $f$-divergence $D_f$ will satisfy $D_f \geq c_f V^2$ or $D_f \geq c_{2,f} V^2 + c_{4,f} V^4$, where $V$ denotes variational distance and the coefficients $c_f$, $c_{2,f}$ and $c_{4,f}$ are {\em best possible}. As a consequence, we obtain lower bounds in terms of $V$ for many well known distance and divergence measures. For instance, let $D_{(\alpha)} (P,Q) = [\alpha (\alpha-1)]^{-1} [\int q^{\alpha} p^{1-\alpha} d \mu -1]$ and ${\cal I}_\alpha (P,Q) = (\alpha -1)^{-1} \log [\int p^\alpha q^{1-\alpha} d \mu]$ be respectively the {\em relative information of type} ($1-\alpha$) and {\em R\'{e}nyi's information gain of order} $\alpha$. We show that $D_{(\alpha)} \geq {1/2} V^2 + {1/72} (\alpha+1)(2-\alpha) V^4$ whenever $-1 \leq \alpha \leq 2$, $\alpha \not= 0,1$ and that ${\cal I}_{\alpha} = \frac{\alpha}{2} V^2 + {1/36} \alpha (1 + 5 \alpha - 5 \alpha^2) V^4$ for $0 < \alpha < 1$. Pinsker's inequality $D \geq {1/2} V^2$ and its extension $D \geq {1/2} V^2 + {1/36} V^4$ are special cases of each one of these.
The paper deals with the f-divergences of Csiszar generalizing the discrimination information of Kullback, the total variation distance, the Hellinger divergence, and the Pearson divergence. All basic properties of f-divergences … The paper deals with the f-divergences of Csiszar generalizing the discrimination information of Kullback, the total variation distance, the Hellinger divergence, and the Pearson divergence. All basic properties of f-divergences including relations to the decision errors are proved in a new manner replacing the classical Jensen inequality by a new generalized Taylor expansion of convex functions. Some new properties are proved too, e.g., relations to the statistical sufficiency and deficiency. The generalized Taylor expansion also shows very easily that all f-divergences are average statistical informations (differences between prior and posterior Bayes errors) mutually differing only in the weights imposed on various prior distributions. The statistical information introduced by De Groot and the classical information of Shannon are shown to be extremal cases corresponding to alpha=0 and alpha=1 in the class of the so-called Arimoto alpha-informations introduced in this paper for 0<alpha<1 by means of the Arimoto alpha-entropies. Some new examples of f-divergences are introduced as well, namely, the Shannon divergences and the Arimoto alpha-divergences leading for alphauarr1 to the Shannon divergences. Square roots of all these divergences are shown to be metrics satisfying the triangle inequality. The last section introduces statistical tests and estimators based on the minimal f-divergence with the empirical distribution achieved in the families of hypothetic distributions. For the Kullback divergence this leads to the classical likelihood ratio test and estimator
Let V and D denote, respectively, total variation and divergence. We study lower bounds of D with V fixed. The theoretically best (i.e., largest) lower bound determines a function L=L(V), … Let V and D denote, respectively, total variation and divergence. We study lower bounds of D with V fixed. The theoretically best (i.e., largest) lower bound determines a function L=L(V), Vajda's (1970) tight lower bound. The main result is an exact parametrization of L. This leads to Taylor polynomials which are lower bounds for L, and thereby to extensions of the classical Pinsker (1960) inequality which has numerous applications, cf. Pinsker and followers.
Some geometric properties of PD's are established, Kullback's $I$-divergence playing the role of squared Euclidean distance. The minimum discrimination information problem is viewed as that of projecting a PD onto … Some geometric properties of PD's are established, Kullback's $I$-divergence playing the role of squared Euclidean distance. The minimum discrimination information problem is viewed as that of projecting a PD onto a convex set of PD's and useful existence theorems for and characterizations of the minimizing PD are arrived at. A natural generalization of known iterative algorithms converging to the minimizing PD in special situations is given; even for those special cases, our convergence proof is more generally valid than those previously published. As corollaries of independent interest, generalizations of known results on the existence of PD's or nonnegative matrices of a certain form are obtained. The Lagrange multiplier technique is not used.
We study a measure of discrimination that arises in the context of redundancy in guessing the realization of a random variable. This discrimination measure satisfies a Pythagorean-type inequality. The analog … We study a measure of discrimination that arises in the context of redundancy in guessing the realization of a random variable. This discrimination measure satisfies a Pythagorean-type inequality. The analog of this inequality for Kullback-Leibler divergence is well-known.
Rényi divergence is related to Rényi entropy much like information divergence (also called Kullback-Leibler divergence or relative entropy) is related to Shannon's entropy, and comes up in many settings. It … Rényi divergence is related to Rényi entropy much like information divergence (also called Kullback-Leibler divergence or relative entropy) is related to Shannon's entropy, and comes up in many settings. It was introduced by Rényi as a measure of information that satisfies almost the same axioms as information divergence. We review the most important properties of Rényi divergence, including its relation to some other distances. We show how Rényi divergence appears when the theory of majorization is generalized from the finite to the continuous setting. Finally, Rényi divergence plays a role in analyzing the number of binary questions required to guess the values of a sequence of random variables.
We study conditions on $f$ under which an $f$-divergence $D_f$ will satisfy $D_f \geq c_f V^2$ or $D_f \geq c_{2,f} V^2 + c_{4,f} V^4$, where $V$ denotes variational distance and … We study conditions on $f$ under which an $f$-divergence $D_f$ will satisfy $D_f \geq c_f V^2$ or $D_f \geq c_{2,f} V^2 + c_{4,f} V^4$, where $V$ denotes variational distance and the coefficients $c_f$, $c_{2,f}$ and $c_{4,f}$ are {\em best possible}. As a consequence, we obtain lower bounds in terms of $V$ for many well known distance and divergence measures. For instance, let $D_{(\alpha)} (P,Q) = [\alpha (\alpha-1)]^{-1} [\int q^{\alpha} p^{1-\alpha} d \mu -1]$ and ${\cal I}_\alpha (P,Q) = (\alpha -1)^{-1} \log [\int p^\alpha q^{1-\alpha} d \mu]$ be respectively the {\em relative information of type} ($1-\alpha$) and {\em R\'{e}nyi's information gain of order} $\alpha$. We show that $D_{(\alpha)} \geq {1/2} V^2 + {1/72} (\alpha+1)(2-\alpha) V^4$ whenever $-1 \leq \alpha \leq 2$, $\alpha \not= 0,1$ and that ${\cal I}_{\alpha} = \frac{\alpha}{2} V^2 + {1/36} \alpha (1 + 5 \alpha - 5 \alpha^2) V^4$ for $0 < \alpha < 1$. Pinsker's inequality $D \geq {1/2} V^2$ and its extension $D \geq {1/2} V^2 + {1/36} V^4$ are special cases of each one of these.
Rényi divergence is related to Rényi entropy much like information divergence (also called Kullback-Leibler divergence or relative entropy) is related to Shannon's entropy, and comes up in many settings. It … Rényi divergence is related to Rényi entropy much like information divergence (also called Kullback-Leibler divergence or relative entropy) is related to Shannon's entropy, and comes up in many settings. It was introduced by Rényi as a measure of information that satisfies almost the same axioms as information divergence. We review the most important properties of Rényi divergence, including its relation to some other distances. We show how Rényi divergence appears when the theory of majorization is generalized from the finite to the continuous setting. Finally, Rényi divergence plays a role in analyzing the number of binary questions required to guess the values of a sequence of random variables.