Author Description

Login to generate an author description

Ask a Question About This Mathematician

All published works (30)

We study the problem of maximizing a function that is approximately submodular under a cardinality constraint. Approximate submodularity implicitly appears in a wide range of applications as in many cases … We study the problem of maximizing a function that is approximately submodular under a cardinality constraint. Approximate submodularity implicitly appears in a wide range of applications as in many cases errors in evaluation of a submodular function break submodularity. Say that $F$ is $\varepsilon$-approximately submodular if there exists a submodular function $f$ such that $(1-\varepsilon)f(S) \leq F(S)\leq (1+\varepsilon)f(S)$ for all subsets $S$. We are interested in characterizing the query-complexity of maximizing $F$ subject to a cardinality constraint $k$ as a function of the error level $\varepsilon>0$. We provide both lower and upper bounds: for $\varepsilon>n^{-1/2}$ we show an exponential query-complexity lower bound. In contrast, when $\varepsilon< {1}/{k}$ or under a stronger bounded curvature assumption, we give constant approximation algorithms.
We consider a Stackelberg game in which a principal (she) establishes a two-stage contract with a non-myopic agent (he) whose type is unknown. The contract takes the form of an … We consider a Stackelberg game in which a principal (she) establishes a two-stage contract with a non-myopic agent (he) whose type is unknown. The contract takes the form of an incentive function mapping the agent's first-stage action to his second-stage incentive. While the first-stage action reveals the agent's type under truthful play, a non-myopic agent could benefit from portraying a false type in the first stage to obtain a larger incentive in the second stage. The challenge is thus for the principal to design the incentive function so as to induce truthful play. We show that this is only possible with a constant, non-reactive incentive functions when the type space is continuous, whereas it can be achieved with reactive functions for discrete types. Additionally, we show that introducing an adjustment mechanism that penalizes inconsistent behavior across both stages allows the principal to design more flexible incentive functions.
Modern data sets, such as those in healthcare and e-commerce, are often derived from many individuals or systems but have insufficient data from each source alone to separately estimate individual, … Modern data sets, such as those in healthcare and e-commerce, are often derived from many individuals or systems but have insufficient data from each source alone to separately estimate individual, often high-dimensional, model parameters. If there is shared structure among systems however, it may be possible to leverage data from other systems to help estimate individual parameters, which could otherwise be non-identifiable. In this paper, we assume systems share a latent low-dimensional parameter space and propose a method for recovering d-dimensional parameters for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$N$</tex> different linear systems, when there are only <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$T &lt; d$</tex> observations per system. To do so, we develop a three-step method which estimates the low-dimensional subspace spanned by the systems' parameters and produces refined parameter estimates within the subspace. We provide finite sample subspace estimation error guarantees for our proposed method. Finally, we experimentally validate our method on simulations with i.i.d. regression data and as well as correlated time series data.
We consider games of incomplete information in which the players' payoffs depend both on a privately observed type and an unknown but common "state of nature". External to the game, … We consider games of incomplete information in which the players' payoffs depend both on a privately observed type and an unknown but common "state of nature". External to the game, a data provider knows the state of nature and sells information to the players, thus solving a joint information and mechanism design problem: deciding which information to sell while eliciting the player' types and collecting payments. We restrict ourselves to a general class of symmetric games with quadratic payoffs that includes games of both strategic substitutes (e.g. Cournot competition) and strategic complements (e.g. Bertrand competition, Keynesian beauty contest). By to the Revelation Principle, the sellers' problem reduces to designing a mechanism that truthfully elicits the player' types and sends action recommendations that constitute a Bayes Correlated Equilibrium of the game. We fully characterize the class of all such Gaussian mechanisms (where the joint distribution of actions and private signals is a multivariate normal distribution) as well as the welfare- and revenue- optimal mechanisms within this class. For games of strategic complements, the optimal mechanisms maximally correlate the players' actions, and conversely maximally anticorrelate them for games of strategic substitutes. In both cases, for sufficiently large uncertainty over the players' types, the recommendations are deterministic (and linear) conditional on the state and the type reports, but they are not fully revealing.
The design of data markets has gained importance as firms increasingly use machine learning models fueled by externally acquired training data. A key consideration is the externalities firms face when … The design of data markets has gained importance as firms increasingly use machine learning models fueled by externally acquired training data. A key consideration is the externalities firms face when data, though inherently freely replicable, is allocated to competing firms. In this setting, we demonstrate that a data seller’s optimal revenue increases as firms can pay to prevent allocations to others. To do so, we first reduce the combinatorial problem of allocating and pricing multiple datasets to the auction of a single digital good by modeling utility for data through the increase in prediction accuracy it provides. We then derive welfare and revenue maximizing mechanisms, highlighting how the form of firms’ private information– whether the externalities one exerts on others is known, or vice-versa– affects the resulting structures. In all cases, the optimal allocation rule is a single threshold per firm, where either all data is allocated or none is.
Modern data sets, such as those in healthcare and e-commerce, are often derived from many individuals or systems but have insufficient data from each source alone to separately estimate individual, … Modern data sets, such as those in healthcare and e-commerce, are often derived from many individuals or systems but have insufficient data from each source alone to separately estimate individual, often high-dimensional, model parameters. If there is shared structure among systems however, it may be possible to leverage data from other systems to help estimate individual parameters, which could otherwise be non-identifiable. In this paper, we assume systems share a latent low-dimensional parameter space and propose a method for recovering $d$-dimensional parameters for $N$ different linear systems, even when there are only $T<d$ observations per system. To do so, we develop a three-step algorithm which estimates the low-dimensional subspace spanned by the systems' parameters and produces refined parameter estimates within the subspace. We provide finite sample subspace estimation error guarantees for our proposed method. Finally, we experimentally validate our method on simulations with i.i.d. regression data and as well as correlated time series data.
In contrast to their seemingly simple and shared structure of independence and stationarity, L\'evy processes exhibit a wide variety of behaviors, from the self-similar Wiener process to piecewise-constant compound Poisson … In contrast to their seemingly simple and shared structure of independence and stationarity, L\'evy processes exhibit a wide variety of behaviors, from the self-similar Wiener process to piecewise-constant compound Poisson processes. Inspired by the recent paper of Ghourchian, Amini, and Gohari (2018), we characterize their compressibility by studying the entropy of their double discretization (both in time and amplitude) in the regime of vanishing discretization steps. For a L\'evy process with absolutely continuous marginals, this reduces to understanding the asymptotics of the differential entropy of its marginals at small times, for which we obtain a new local central limit theorem. We generalize known results for stable processes to the non-stable case, with a special focus on L\'evy processes that are locally self-similar, and conceptualize a new compressibility hierarchy of L\'evy processes, captured by their Blumenthal-Getoor index.
Data buyers compete in a game of incomplete information about which a single data seller owns some payoff-relevant information. The seller faces a joint information- and mechanism-design problem: deciding which … Data buyers compete in a game of incomplete information about which a single data seller owns some payoff-relevant information. The seller faces a joint information- and mechanism-design problem: deciding which information to sell, while eliciting the buyers' types and imposing payments. We derive the welfare- and revenue-optimal mechanisms for a class of games with binary actions and states. Our results highlight the critical properties of selling information in competitive environments: (i) the negative externalities arising from buyer competition increase the profitability of recommending the correct action to one buyer exclusively; (ii) for the buyers to follow the seller's recommendations, the degree of exclusivity must be limited; (iii) the buyers' obedience constraints also limit the distortions in the allocation of information introduced by a monopolist seller; (iv) as competition becomes fiercer, these limitations become more severe, weakening the impact of market power on the allocation of information.
The families of $f$-divergences (e.g. the Kullback-Leibler divergence) and Integral Probability Metrics (e.g. total variation distance or maximum mean discrepancies) are widely used to quantify the similarity between probability distributions. … The families of $f$-divergences (e.g. the Kullback-Leibler divergence) and Integral Probability Metrics (e.g. total variation distance or maximum mean discrepancies) are widely used to quantify the similarity between probability distributions. In this work, we systematically study the relationship between these two families from the perspective of convex duality. Starting from a tight variational representation of the $f$-divergence, we derive a generalization of the moment-generating function, which we show exactly characterizes the best lower bound of the $f$-divergence as a function of a given IPM. Using this characterization, we obtain new bounds while also recovering in a unified manner well-known results, such as Hoeffding's lemma, Pinsker's inequality and its extension to subgaussian functions, and the Hammersley-Chapman-Robbins bound. The variational representation also allows us to prove new results on topological properties of the divergence which may be of independent interest.
Abstract The need for statistical estimation with large data sets has reinvigorated interest in iterative procedures and stochastic optimization. Stochastic approximations are at the forefront of this recent development as … Abstract The need for statistical estimation with large data sets has reinvigorated interest in iterative procedures and stochastic optimization. Stochastic approximations are at the forefront of this recent development as they yield procedures that are simple, general and fast. However, standard stochastic approximations are often numerically unstable. Deterministic optimization, in contrast, increasingly uses proximal updates to achieve numerical stability in a principled manner. A theoretical gap has thus emerged. While standard stochastic approximations are subsumed by the framework Robbins and Monro (The annals of mathematical statistics, 1951, pp. 400–407), there is no such framework for stochastic approximations with proximal updates. In this paper, we conceptualize a proximal version of the classical Robbins–Monro procedure. Our theoretical analysis demonstrates that the proposed procedure has important stability benefits over the classical Robbins–Monro procedure, while it retains the best known convergence rates. Exact implementations of the proximal Robbins–Monro procedure are challenging, but we show that approximate implementations lead to procedures that are easy to implement, and still dominate standard procedures by achieving numerical stability, practically without trade-offs. Moreover, approximate proximal Robbins–Monro procedures can be applied even when the objective cannot be calculated analytically, and so they generalize stochastic proximal procedures currently in use.
In contrast to their seemingly simple and shared structure of independence and stationarity, Levy processes exhibit a wide variety of behaviors, from the self-similar Wiener process to piecewise-constant compound Poisson … In contrast to their seemingly simple and shared structure of independence and stationarity, Levy processes exhibit a wide variety of behaviors, from the self-similar Wiener process to piecewise-constant compound Poisson processes. Inspired by the recent paper of Ghourchian, Amini, and Gohari, we characterize their compressibility by studying the entropy of their double discretization (both in time and amplitude) in the regime of vanishing discretization steps. For a Levy process with absolutely continuous marginals, this reduces to understanding the asymptotics of the differential entropy of its marginals at small times. We generalize known results for stable processes to the non-stable case, and conceptualize a new compressibility hierarchy of Levy processes, captured by their Blumenthal-Getoor index.
This paper shows several connections between data structure problems and cryptography against preprocessing attacks. Our results span data structure upper bounds, cryptographic applications, and data structure lower bounds, as summarized … This paper shows several connections between data structure problems and cryptography against preprocessing attacks. Our results span data structure upper bounds, cryptographic applications, and data structure lower bounds, as summarized next.
The families of $f$-divergences (e.g. the Kullback-Leibler divergence) and Integral Probability Metrics (e.g. total variation distance or maximum mean discrepancies) are widely used to quantify the similarity between probability distributions. … The families of $f$-divergences (e.g. the Kullback-Leibler divergence) and Integral Probability Metrics (e.g. total variation distance or maximum mean discrepancies) are widely used to quantify the similarity between probability distributions. In this work, we systematically study the relationship between these two families from the perspective of convex duality. Starting from a tight variational representation of the $f$-divergence, we derive a generalization of the moment-generating function, which we show exactly characterizes the best lower bound of the $f$-divergence as a function of a given IPM. Using this characterization, we obtain new bounds while also recovering in a unified manner well-known results, such as Hoeffding's lemma, Pinsker's inequality and its extension to subgaussian functions, and the Hammersley-Chapman-Robbins bound. The variational representation also allows us to prove new results on topological properties of the divergence which may be of independent interest.
The design of data markets has gained importance as firms increasingly use machine learning models fueled by externally acquired training data. A key consideration is the externalities firms face when … The design of data markets has gained importance as firms increasingly use machine learning models fueled by externally acquired training data. A key consideration is the externalities firms face when data, though inherently freely replicable, is allocated to competing firms. In this setting, we demonstrate that a data seller's optimal revenue increases as firms can pay to prevent allocations to others. To do so, we first reduce the combinatorial problem of allocating and pricing multiple datasets to the auction of a single digital good by modeling utility for data through the increase in prediction accuracy it provides. We then derive welfare and revenue maximizing mechanisms, highlighting how the form of firms' private information - whether the externalities one exerts on others is known, or vice-versa - affects the resulting structures. In all cases, under appropriate assumptions, the optimal allocation rule is a single threshold per firm, where either all data is allocated or none is.
The families of $f$-divergences (e.g. the Kullback-Leibler divergence) and Integral Probability Metrics (e.g. total variation distance or maximum mean discrepancies) are widely used to quantify the similarity between probability distributions. … The families of $f$-divergences (e.g. the Kullback-Leibler divergence) and Integral Probability Metrics (e.g. total variation distance or maximum mean discrepancies) are widely used to quantify the similarity between probability distributions. In this work, we systematically study the relationship between these two families from the perspective of convex duality. Starting from a tight variational representation of the $f$-divergence, we derive a generalization of the moment-generating function, which we show exactly characterizes the best lower bound of the $f$-divergence as a function of a given IPM. Using this characterization, we obtain new bounds while also recovering in a unified manner well-known results, such as Hoeffding's lemma, Pinsker's inequality and its extension to subgaussian functions, and the Hammersley-Chapman-Robbins bound. This characterization also allows us to prove new results on topological properties of the divergence which may be of independent interest.
This paper shows several connections between data structure problems and cryptography against preprocessing attacks. Our results span data structure upper bounds, cryptographic applications, and data structure lower bounds, as summarized … This paper shows several connections between data structure problems and cryptography against preprocessing attacks. Our results span data structure upper bounds, cryptographic applications, and data structure lower bounds, as summarized next. First, we apply Fiat--Naor inversion, a technique with cryptographic origins, to obtain a data structure upper bound. In particular, our technique yields a suite of algorithms with space $S$ and (online) time $T$ for a preprocessing version of the $N$-input 3SUM problem where $S^3\cdot T = \widetilde{O}(N^6)$. This disproves a strong conjecture (Goldstein et al., WADS 2017) that there is no data structure that solves this problem for $S=N^{2-\delta}$ and $T = N^{1-\delta}$ for any constant $\delta>0$. Secondly, we show equivalence between lower bounds for a broad class of (static) data structure problems and one-way functions in the random oracle model that resist a very strong form of preprocessing attack. Concretely, given a random function $F: [N] \to [N]$ (accessed as an oracle) we show how to compile it into a function $G^F: [N^2] \to [N^2]$ which resists $S$-bit preprocessing attacks that run in query time $T$ where $ST=O(N^{2-\varepsilon})$ (assuming a corresponding data structure lower bound on 3SUM). In contrast, a classical result of Hellman tells us that $F$ itself can be more easily inverted, say with $N^{2/3}$-bit preprocessing in $N^{2/3}$ time. We also show that much stronger lower bounds follow from the hardness of kSUM. Our results can be equivalently interpreted as security against adversaries that are very non-uniform, or have large auxiliary input, or as security in the face of a powerfully backdoored random oracle. Thirdly, we give non-adaptive lower bounds for 3SUM and a range of geometric problems which match the best known lower bounds for static data structure problems.
We introduce hardness in relative entropy, a new notion of hardness for search problems which on the one hand is satisfied by all one-way functions and on the other hand … We introduce hardness in relative entropy, a new notion of hardness for search problems which on the one hand is satisfied by all one-way functions and on the other hand implies both next-block pseudoentropy and inaccessible entropy, two forms of computational entropy used in recent constructions of pseudorandom generators and statistically hiding commitment schemes, respectively. Thus, hardness in relative entropy unifies the latter two notions of computational entropy and sheds light on the apparent "duality" between them. Additionally, it yields a more modular and illuminating proof that one-way functions imply next-block inaccessible entropy, similar in structure to the proof that one-way functions imply next-block pseudoentropy (Vadhan and Zheng, STOC '12).
We study secure and undetectable communication in a world where governments can read all encrypted communications of citizens. We consider a world where the only permitted communication method is via … We study secure and undetectable communication in a world where governments can read all encrypted communications of citizens. We consider a world where the only permitted communication method is via a government-mandated encryption scheme, using government-mandated keys. Citizens caught trying to communicate otherwise (e.g., by encrypting strings which do not appear to be natural language plaintexts) will be arrested. The one guarantee we suppose is that the government-mandated encryption scheme is semantically secure against outsiders: a perhaps advantageous feature to secure communication against foreign entities. But what good is semantic security against an adversary that has the power to decrypt? Even in this pessimistic scenario, we show citizens can communicate securely and undetectably. Informally, there is a protocol between Alice and Bob where they exchange ciphertexts that look innocuous even to someone who knows the secret keys and thus sees the corresponding plaintexts. And yet, in the end, Alice will have transmitted her secret message to Bob. Our security definition requires indistinguishability between unmodified use of the mandated encryption scheme, and conversations using the mandated encryption scheme in a modified way for subliminal communication. Our topics may be thought to fall broadly within the realm of steganography: the science of hiding secret communication in innocent-looking messages, or cover objects. However, we deal with the non-standard setting of adversarial cover object distributions (i.e., a stronger-than-usual adversary). We leverage that our cover objects are ciphertexts of a secure encryption scheme to bypass impossibility results which we show for broader classes of steganographic schemes. We give several constructions of subliminal communication schemes based on any key exchange protocol with random messages (e.g., Diffie-Hellman).
The need for parameter estimation with massive data has reinvigorated interest in iterative estimation procedures. Stochastic approximations, such as stochastic gradient descent, are at the forefront of this recent development … The need for parameter estimation with massive data has reinvigorated interest in iterative estimation procedures. Stochastic approximations, such as stochastic gradient descent, are at the forefront of this recent development because they yield simple, generic, and extremely fast iterative estimation procedures. Such stochastic approximations, however, are often numerically unstable. As a consequence, current practice has turned to proximal operators, which can induce stable parameter updates within iterations. While the majority of classical iterative estimation procedures are subsumed by the framework of Robbins and Monro (1951), there is no such generalization for stochastic approximations with proximal updates. In this paper, we conceptualize a general stochastic approximation method with proximal updates. This method can be applied even in situations where the analytical form of the objective is not known, and so it generalizes many stochastic gradient procedures with proximal operators currently in use. Our theoretical analysis indicates that the proposed method has important stability benefits over the classical stochastic approximation method. Exact instantiations of the proposed method are challenging, but we show that approximate instantiations lead to procedures that are easy to implement, and still dominate classical procedures by achieving numerical stability without tradeoffs. This last advantage is akin to that seen in deterministic proximal optimization, where the framework is typically impossible to instantiate exactly, but where approximate instantiations lead to new optimization procedures that dominate classical ones.
The need for parameter estimation with massive datasets has reinvigorated interest in stochastic optimization and iterative estimation procedures. Stochastic approximations are at the forefront of this recent development as they … The need for parameter estimation with massive datasets has reinvigorated interest in stochastic optimization and iterative estimation procedures. Stochastic approximations are at the forefront of this recent development as they yield procedures that are simple, general, and fast. However, standard stochastic approximations are often numerically unstable. Deterministic optimization, on the other hand, increasingly uses proximal updates to achieve numerical stability in a principled manner. A theoretical gap has thus emerged. While standard stochastic approximations are subsumed by the framework of Robbins and Monro (1951), there is no such framework for stochastic approximations with proximal updates. In this paper, we conceptualize a proximal version of the classical Robbins-Monro procedure. Our theoretical analysis demonstrates that the proposed procedure has important stability benefits over the classical Robbins-Monro procedure, while it retains the best known convergence rates. Exact implementations of the proximal Robbins-Monro procedure are challenging, but we show that approximate implementations lead to procedures that are easy to implement, and still dominate classical procedures by achieving numerical stability, practically without tradeoffs. Moreover, approximate proximal Robbins-Monro procedures can be applied even when the objective cannot be calculated analytically, and so they generalize stochastic proximal procedures currently in use.
In the Network Inference problem, one seeks to recover the edges of an unknown graph from the observations of cascades propagating over this graph. In this paper, we approach this … In the Network Inference problem, one seeks to recover the edges of an unknown graph from the observations of cascades propagating over this graph. In this paper, we approach this problem from the sparse recovery perspective. We introduce a general model of cascades, including the voter model and the independent cascade model, for which we provide the first algorithm which recovers the graph's edges with high probability and $O(s\log m)$ measurements where $s$ is the maximum degree of the graph and $m$ is the number of nodes. Furthermore, we show that our algorithm also recovers the edge weights (the parameters of the diffusion process) and is robust in the context of approximate sparsity. Finally we prove an almost matching lower bound of $\Omega(s\log\frac{m}{s})$ and validate our approach empirically on synthetic graphs.
In recent years, social networking platforms have developed into extraordinary channels for spreading and consuming information. Along with the rise of such infrastructure, there is continuous progress on techniques for … In recent years, social networking platforms have developed into extraordinary channels for spreading and consuming information. Along with the rise of such infrastructure, there is continuous progress on techniques for spreading information effectively through influential users. In many applications, one is restricted to select influencers from a set of users who engaged with the topic being promoted, and due to the structure of social networks, these users often rank low in terms of their influence potential. An alternative approach one can consider is an adaptive method which selects users in a manner which targets their influential neighbors. The advantage of such an approach is that it leverages the friendship paradox in social networks: while users are often not influential, they often know someone who is. Despite the various complexities in such optimization problems, we show that scalable adaptive seeding is achievable. In particular, we develop algorithms for linear influence models with provable approximation guarantees that can be gracefully parallelized. To show the effectiveness of our methods we collected data from various verticals social network users follow. For each vertical, we collected data on the users who responded to a certain post as well as their neighbors, and applied our methods on this data. Our experiments show that adaptive seeding is scalable, and importantly, that it obtains dramatic improvements over standard approaches of information dissemination.
In many applications of influence maximization, one is restricted to select influencers from a set of users who engaged with the topic being promoted, and due to the structure of … In many applications of influence maximization, one is restricted to select influencers from a set of users who engaged with the topic being promoted, and due to the structure of social networks, these users often rank low in terms of their influence potential. To alleviate this issue, one can consider an adaptive method which selects users in a manner which targets their influential neighbors. The advantage of such an approach is that it leverages the friendship paradox in social networks: while users are often not influential, they often know someone who is. Despite the various complexities in such optimization problems, we show that scalable adaptive seeding is achievable. To show the effectiveness of our methods we collected data from various verticals social network users follow, and applied our methods on it. Our experiments show that adaptive seeding is scalable, and that it obtains dramatic improvements over standard approaches of information dissemination.
In the Network Inference problem, one seeks to recover the edges of an unknown graph from the observations of cascades propagating over this graph. In this paper, we approach this … In the Network Inference problem, one seeks to recover the edges of an unknown graph from the observations of cascades propagating over this graph. In this paper, we approach this problem from the sparse recovery perspective. We introduce a general model of cascades, including the voter model and the independent cascade model, for which we provide the first algorithm which recovers the graph's edges with high probability and $O(s\log m)$ measurements where $s$ is the maximum degree of the graph and $m$ is the number of nodes. Furthermore, we show that our algorithm also recovers the edge weights (the parameters of the diffusion process) and is robust in the context of approximate sparsity. Finally we prove an almost matching lower bound of $\Omega(s\log\frac{m}{s})$ and validate our approach empirically on synthetic graphs.
In recent years, social networking platforms have developed into extraordinary channels for spreading and consuming information. Along with the rise of such infrastructure, there is continuous progress on techniques for … In recent years, social networking platforms have developed into extraordinary channels for spreading and consuming information. Along with the rise of such infrastructure, there is continuous progress on techniques for spreading information effectively through influential users. In many applications, one is restricted to select influencers from a set of users who engaged with the topic being promoted, and due to the structure of social networks, these users often rank low in terms of their influence potential. An alternative approach one can consider is an adaptive method which selects users in a manner which targets their influential neighbors. The advantage of such an approach is that it leverages the friendship paradox in social networks: while users are often not influential, they often know someone who is. Despite the various complexities in such optimization problems, we show that scalable adaptive seeding is achievable. In particular, we develop algorithms for linear influence models with provable approximation guarantees that can be gracefully parallelized. To show the effectiveness of our methods we collected data from various verticals social network users follow. For each vertical, we collected data on the users who responded to a certain post as well as their neighbors, and applied our methods on this data. Our experiments show that adaptive seeding is scalable, and importantly, that it obtains dramatic improvements over standard approaches of information dissemination.
The need for parameter estimation with massive datasets has reinvigorated interest in stochastic optimization and iterative estimation procedures. Stochastic approximations are at the forefront of this recent development as they … The need for parameter estimation with massive datasets has reinvigorated interest in stochastic optimization and iterative estimation procedures. Stochastic approximations are at the forefront of this recent development as they yield procedures that are simple, general, and fast. However, standard stochastic approximations are often numerically unstable. Deterministic optimization, on the other hand, increasingly uses proximal updates to achieve numerical stability in a principled manner. A theoretical gap has thus emerged. While standard stochastic approximations are subsumed by the framework of Robbins and Monro (1951), there is no such framework for stochastic approximations with proximal updates. In this paper, we conceptualize a proximal version of the classical Robbins-Monro procedure. Our theoretical analysis demonstrates that the proposed procedure has important stability benefits over the classical Robbins-Monro procedure, while it retains the best known convergence rates. Exact implementations of the proximal Robbins-Monro procedure are challenging, but we show that approximate implementations lead to procedures that are easy to implement, and still dominate classical procedures by achieving numerical stability, practically without tradeoffs. Moreover, approximate proximal Robbins-Monro procedures can be applied even when the objective cannot be calculated analytically, and so they generalize stochastic proximal procedures currently in use.

Commonly Cited References

We propose an inertial forward–backward splitting algorithm to compute a zero of a sum of two monotone operators allowing for stochastic errors in the computation of the operators. More precisely, … We propose an inertial forward–backward splitting algorithm to compute a zero of a sum of two monotone operators allowing for stochastic errors in the computation of the operators. More precisely, we establish almost sure convergence in real Hilbert spaces of the sequence of iterates to an optimal solution. Then, based on this analysis, we introduce two new classes of stochastic inertial primal–dual splitting methods for solving structured systems of composite monotone inclusions and prove their convergence. Our results extend to the stochastic and inertial setting various types of structured monotone inclusion problems and corresponding algorithmic solutions. Application to minimization problems is discussed.
SUMMARY We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients … SUMMARY We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.
Methods based on l1-relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: … Methods based on l1-relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisfies the restricted nullspace property, and (2) the squared l2-error of a Lasso estimate decays at the minimax optimal rate k log p / n, where k is the sparsity of the p-dimensional regression problem with additive Gaussian noise, whenever the design satisfies a restricted eigenvalue condition. The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisfied when all entries of X are independent and identically distributed (i.i.d.), or the rows are unitary. This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theoretical guarantees on l1-relaxations to a much broader class of problems than the case of completely independent or unitary designs.
Time plays an essential role in the diffusion of information, influence and disease over networks. In many cases we only observe when a node copies information, makes a decision or … Time plays an essential role in the diffusion of information, influence and disease over networks. In many cases we only observe when a node copies information, makes a decision or becomes infected -- but the connectivity, transmission rates between nodes and transmission sources are unknown. Inferring the underlying dynamics is of outstanding interest since it enables forecasting, influencing and retarding infections, broadly construed. To this end, we model diffusion processes as discrete networks of continuous temporal processes occurring at different rates. Given cascade data -- observed infection times of nodes -- we infer the edges of the global diffusion network and estimate the transmission rates of each edge that best explain the observed data. The optimization problem is convex. The model naturally (without heuristics) imposes sparse solutions and requires no parameter tuning. The problem decouples into a collection of independent smaller problems, thus scaling easily to networks on the order of hundreds of thousands of nodes. Experiments on real and synthetic data show that our algorithm both recovers the edges of diffusion networks and accurately estimates their transmission rates from cascade data.
We identify a necessary and sufficient condition for a Lévy white noise to be a tempered distribution. More precisely, we show that if the Lévy measure associated with this noise … We identify a necessary and sufficient condition for a Lévy white noise to be a tempered distribution. More precisely, we show that if the Lévy measure associated with this noise has a positive absolute moment, then the Lévy white noise almost surely takes values in the space of tempered distributions. If the Lévy measure does not have a positive absolute moment of any order, then the event on which the Lévy white noise is a tempered distribution has probability zero.
Oracle inequalities and variable selection properties for the Lasso in linear models have been established under a variety of different assumptions on the design matrix. We show in this paper … Oracle inequalities and variable selection properties for the Lasso in linear models have been established under a variety of different assumptions on the design matrix. We show in this paper how the different conditions and concepts relate to each other. The restricted eigenvalue condition [2] or the slightly weaker compatibility condition [18] are sufficient for oracle results. We argue that both these conditions allow for a fairly general class of design matrices. Hence, optimality of the Lasso for prediction and estimation holds for more general situations than what it appears from coherence [5, 4] or restricted isometry [10] assumptions.
<para xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> Suppose we are given a vector &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;$f$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; in a class &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;${\cal F} \subset{\BBR}^N$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt;, e.g., a class of digital signals or digital images. How … <para xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> Suppose we are given a vector &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;$f$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; in a class &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;${\cal F} \subset{\BBR}^N$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt;, e.g., a class of digital signals or digital images. How many linear measurements do we need to make about &lt;emphasis&gt;&lt;formula formulatype="inline"&gt;&lt;tex&gt;$f$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; to be able to recover &lt;emphasis&gt;&lt;formula formulatype="inline"&gt;&lt;tex&gt;$f$&lt;/tex&gt; &lt;/formula&gt;&lt;/emphasis&gt; to within precision &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;$\epsilon$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; in the Euclidean &lt;emphasis&gt;&lt;formula formulatype="inline"&gt;&lt;tex&gt;$(\ell_2)$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; metric? This paper shows that if the objects of interest are sparse in a fixed basis or compressible, then it is possible to reconstruct &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;$f$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; to within very high accuracy from a small number of random measurements by solving a simple linear program. More precisely, suppose that the &lt;emphasis&gt;&lt;formula formulatype="inline"&gt;&lt;tex&gt;$n$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt;th largest entry of the vector &lt;emphasis&gt;&lt;formula formulatype="inline"&gt;&lt;tex&gt;$\vert f\vert$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; (or of its coefficients in a fixed basis) obeys &lt;emphasis&gt;&lt;formula formulatype="inline"&gt;&lt;tex&gt;$\vert f\vert _{(n)} \le R \cdot n^{-1/p}$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt;, where &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;$R &gt; 0$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; and &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;$p &gt; 0$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt;. Suppose that we take measurements &lt;emphasis&gt;&lt;formula formulatype="inline"&gt;&lt;tex&gt;$y_k = \langle f, X_k\rangle, k = 1, \ldots, K$&lt;/tex&gt; &lt;/formula&gt;&lt;/emphasis&gt;, where the &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;$X_k$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; are &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;$N$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt;-dimensional Gaussian vectors with independent standard normal entries. Then for each &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;$f$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; obeying the decay estimate above for some &lt;emphasis&gt;&lt;formula formulatype="inline"&gt;&lt;tex&gt;$0 &lt; p &lt; 1$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; and with overwhelming probability, our reconstruction &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;$f^\sharp$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt;, defined as the solution to the constraints &lt;emphasis&gt;&lt;formula formulatype="inline"&gt;&lt;tex&gt;$y_k = \langle f^\sharp, X_k \rangle$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; with minimal &lt;emphasis&gt;&lt;formula formulatype="inline"&gt; &lt;tex&gt;$\ell_1$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; norm, obeys &lt;emphasis&gt; &lt;formula formulatype="display"&gt;&lt;tex&gt;$$ \Vert f - f^\sharp\Vert _{\ell_2} \le C_p \cdot R \cdot (K/\log N)^{-r}, \quad r = 1/p - 1/2. $$&lt;/tex&gt; &lt;/formula&gt;&lt;/emphasis&gt;There is a sense in which this result is optimal; it is generally impossible to obtain a higher accuracy from any set of &lt;emphasis&gt;&lt;formula formulatype="inline"&gt;&lt;tex&gt;$K$&lt;/tex&gt;&lt;/formula&gt;&lt;/emphasis&gt; measurements whatsoever. The methodology extends to various other random measurement ensembles; for example, we show that similar results hold if one observes a few randomly sampled Fourier coefficients of &lt;emphasis&gt;&lt;formula formulatype="inline"&gt;&lt;tex&gt;$f$&lt;/tex&gt; &lt;/formula&gt;&lt;/emphasis&gt;. In fact, the results are quite general and require only two hypotheses on the measurement ensemble which are detailed. </para>
Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but … Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficient conditions that are verifiable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are irrepresentable (in a sense to be clarified) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result.
The authors consider classes of signals that have a finite number of degrees of freedom per unit of time and call this number the rate of innovation. Examples of signals … The authors consider classes of signals that have a finite number of degrees of freedom per unit of time and call this number the rate of innovation. Examples of signals with a finite rate of innovation include streams of Diracs (e.g., the Poisson process), nonuniform splines, and piecewise polynomials. Even though these signals are not bandlimited, we show that they can be sampled uniformly at (or above) the rate of innovation using an appropriate kernel and then be perfectly reconstructed. Thus, we prove sampling theorems for classes of signals and kernels that generalize the classic "bandlimited and sinc kernel" case. In particular, we show how to sample and reconstruct periodic and finite-length streams of Diracs, nonuniform splines, and piecewise polynomials using sinc and Gaussian kernels. For infinite-length signals with finite local rate of innovation, we show local sampling and reconstruction based on spline kernels. The key in all constructions is to identify the innovative part of a signal (e.g., time instants and weights of Diracs) using an annihilating or locator filter: a device well known in spectral analysis and error-correction coding. This leads to standard computational procedures for solving the sampling problem, which we show through experimental results. Applications of these new sampling results can be found in signal processing, communications systems, and biological systems.
High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to … High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless $p/n\rightarrow0$, a line of recent work has studied models with various types of low-dimensional structure, including sparse vectors, sparse and structured matrices, low-rank matrices and combinations thereof. In such settings, a general approach to estimation is to solve a regularized optimization problem, which combines a loss function measuring how well the model fits the data with some regularization function that encourages the assumed structure. This paper provides a unified framework for establishing consistency and convergence rates for such regularized M-estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive some existing results, and also to obtain a number of new results on consistency and convergence rates, in both $\ell_2$-error and related norms. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure corresponding regularized M-estimators have fast convergence rates and which are optimal in many well-studied cases.
The goal of this paper is twofold. In the first part we will study Lévy white noise in different distributional spaces and solve equations of the type $p(D)s=q(D)\dot{L}$, where $p$ … The goal of this paper is twofold. In the first part we will study Lévy white noise in different distributional spaces and solve equations of the type $p(D)s=q(D)\dot{L}$, where $p$ and $q$ are polynomials. Furthermore, we will study measurability of $s$ in Besov spaces. By using this result we will prove that stochastic partial differential equations of the form \begin{align*} p(D)u=g(\cdot,u)+\dot{L} \end{align*} have measurable solutions in weighted Besov spaces, where $p(D)$ is a partial differential operator in a certain class, $g:\mathbb{R}^d\times \mathbb{C}\to \mathbb{R}$ satisfies some Lipschitz condition and $\dot{L}$ is a Lévy white noise.
Layered stable (multivariate) distributions and processes are defined and studied. A layered stable process combines stable trends of two different indices, one of them possibly Gaussian. More precisely, over short … Layered stable (multivariate) distributions and processes are defined and studied. A layered stable process combines stable trends of two different indices, one of them possibly Gaussian. More precisely, over short intervals it is close to a stable process, while over long intervals it approximates another stable (possibly Gaussian) process. The absolute continuity of a layered stable process with respect to its short-range limiting stable process is also investigated. A series representation of layered stable processes is derived, giving insights into the structure both of the sample paths and of the short- and long-range behaviours of the process. This series representation is further used for simulation of sample paths.
Complex networks describe a wide range of systems in nature and society, much quoted examples including the cell, a network of chemicals linked by chemical reactions, or the Internet, a … Complex networks describe a wide range of systems in nature and society, much quoted examples including the cell, a network of chemicals linked by chemical reactions, or the Internet, a network of routers and computers connected by physical links. While traditionally these systems were modeled as random graphs, it is increasingly recognized that the topology and evolution of real networks is governed by robust organizing principles. Here we review the recent advances in the field of complex networks, focusing on the statistical mechanics of network topology and dynamics. After reviewing the empirical data that motivated the recent interest in networks, we discuss the main models and analytical tools, covering random graphs, small-world and scale-free networks, as well as the interplay between topology and the network's robustness against failures and attacks.
Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of … Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or p- values for these models. We consider here high-dimensional linear regression problem, and propose an efficient algorithm for constructing confidence intervals and p-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a 'de-biased' version of regularized M-estimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. We test our method on synthetic data and a high-throughput genomic data set about riboflavin production rate, made publicly available by Buhlmann et al. (2014).
We introduce a definition of the notion of compressibility for infinite deterministic and i.i.d. random sequences which is based on the asymptotic behavior of truncated subsequences. For this purpose, we … We introduce a definition of the notion of compressibility for infinite deterministic and i.i.d. random sequences which is based on the asymptotic behavior of truncated subsequences. For this purpose, we use asymptotic results regarding the distribution of order statistics for heavy-tail distributions and their link with α -stable laws for 1 <; α <; 2 . In many cases, our proposed definition of compressibility coincides with intuition. In particular, we prove that heavy-tail (polynomial decaying) distributions fulfill the requirements of compressibility. On the other hand, exponential decaying distributions like Laplace and Gaussian do not. The results are such that two compressible distributions can be compared with each other in terms of their degree of compressibility.
This paper considers the model problem of reconstructing an object from incomplete frequency samples. Consider a discrete-time signal f/spl isin/C/sup N/ and a randomly chosen set of frequencies /spl Omega/. … This paper considers the model problem of reconstructing an object from incomplete frequency samples. Consider a discrete-time signal f/spl isin/C/sup N/ and a randomly chosen set of frequencies /spl Omega/. Is it possible to reconstruct f from the partial knowledge of its Fourier coefficients on the set /spl Omega/? A typical result of this paper is as follows. Suppose that f is a superposition of |T| spikes f(t)=/spl sigma//sub /spl tau//spl isin/T/f(/spl tau/)/spl delta/(t-/spl tau/) obeying |T|/spl les/C/sub M//spl middot/(log N)/sup -1/ /spl middot/ |/spl Omega/| for some constant C/sub M/>0. We do not know the locations of the spikes nor their amplitudes. Then with probability at least 1-O(N/sup -M/), f can be reconstructed exactly as the solution to the /spl lscr//sub 1/ minimization problem. In short, exact recovery may be obtained by solving a convex optimization problem. We give numerical values for C/sub M/ which depend on the desired probability of success. Our result may be interpreted as a novel kind of nonlinear sampling theorem. In effect, it says that any signal made out of |T| spikes may be recovered by convex programming from almost every set of frequencies of size O(|T|/spl middot/logN). Moreover, this is nearly optimal in the sense that any method succeeding with probability 1-O(N/sup -M/) would in general require a number of frequency samples at least proportional to |T|/spl middot/logN. The methodology extends to a variety of other situations and higher dimensions. For example, we show how one can reconstruct a piecewise constant (one- or two-dimensional) object from incomplete frequency samples - provided that the number of jumps (discontinuities) obeys the condition above - by minimizing other convex functionals such as the total variation of f.
In this paper, we identify a class of absolutely continuous probability distributions, and show that the differential entropy is uniformly convergent over this space under the metric of total variation … In this paper, we identify a class of absolutely continuous probability distributions, and show that the differential entropy is uniformly convergent over this space under the metric of total variation distance. One of the advantages of this class is that the requirements could be readily verified for a given distribution.
The sparsity and compressibility of finite-dimensional signals are of great interest in fields, such as compressed sensing. The notion of compressibility is also extended to infinite sequences of independent identically … The sparsity and compressibility of finite-dimensional signals are of great interest in fields, such as compressed sensing. The notion of compressibility is also extended to infinite sequences of independent identically distributed or ergodic random variables based on the observed error in their nonlinear k-term approximation. In this paper, we use the entropy measure to study the compressibility of continuous-domain innovation processes (alternatively known as white noise). Specifically, we define such a measure as the entropy limit of the doubly quantized (time and amplitude) process. This provides a tool to compare the compressibility of various innovation processes. It also allows us to identify an analogue of the concept of “entropy dimension" which was originally defined by Rényi for random variables. Particular attention is given to stable and impulsive Poisson innovation processes. Here, our results recognize Poisson innovations as the more compressible ones with an entropy measure far below that of stable innovations. While this result departs from the previous knowledge regarding the compressibility of impulsive Poisson laws compared with continuous fat-tailed distributions, our entropy measure ranks α-stable innovations according to their tail.
Information spreads across social and technological networks, but often the network structures are hidden from us and we only observe the traces left by the diffusion processes, called cascades. Can … Information spreads across social and technological networks, but often the network structures are hidden from us and we only observe the traces left by the diffusion processes, called cascades. Can we recover the hidden network structures from these observed cascades? What kind of cascades and how many cascades do we need? Are there some network structures which are more difficult than others to recover? Can we design efficient inference algorithms with provable guarantees? Despite the increasing availability of cascade data and methods for inferring networks from these data, a thorough theoretical understanding of the above questions remains largely unexplored in the literature. In this paper, we investigate the network structure inference problem for a general family of continuous-time diffusion models using an $l_1$-regularized likelihood maximization framework. We show that, as long as the cascade sampling process satisfies a natural incoherence condition, our framework can recover the correct network structure with high probability if we observe $O(d^3 \log N)$ cascades, where $d$ is the maximum number of parents of a node and $N$ is the total number of nodes. Moreover, we develop a simple and efficient soft-thresholding inference algorithm, which we use to illustrate the consequences of our theoretical results, and show that our framework outperforms other alternatives in practice.
We prove novel convergence results for a stochastic proximal gradient algorithm suitable for solving a large class of convex optimization problems, where a convex objective function is given by the … We prove novel convergence results for a stochastic proximal gradient algorithm suitable for solving a large class of convex optimization problems, where a convex objective function is given by the sum of a smooth and a possibly non-smooth component. We consider the iterates convergence and derive $O(1/n)$ non asymptotic bounds in expectation in the strongly convex case, as well as almost sure convergence results under weaker assumptions. Our approach allows to avoid averaging and weaken boundedness assumptions which are often considered in theoretical studies and might not be satisfied in practice.
Discover New Methods for Dealing with High-Dimensional Data A sparse statistical model has only a small number of nonzero parameters or weights; therefore, it is much easier to estimate and … Discover New Methods for Dealing with High-Dimensional Data A sparse statistical model has only a small number of nonzero parameters or weights; therefore, it is much easier to estimate and interpret than a dense model. Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data. Top experts in this rapidly evolving field, the authors describe the lasso for linear regression and a simple coordinate descent algorithm for its computation. They discuss the application of 1 penalties to generalized linear models and support vector machines, cover generalized penalties such as the elastic net and group lasso, and review numerical methods for optimization. They also present statistical inference methods for fitted (lasso) models, including the bootstrap, Bayesian methods, and recently developed approaches. In addition, the book examines matrix decomposition, sparse multivariate analysis, graphical models, and compressed sensing. It concludes with a survey of theoretical results for the lasso. In this age of big data, the number of features measured on a person or object can be large and might be larger than the number of observations. This book shows how the sparsity assumption allows us to tackle these problems and extract useful and reproducible patterns from big datasets. Data analysts, computer scientists, and theorists will appreciate this thorough and up-to-date treatment of sparse statistical modeling.
In this paper we consider optimization problems where the objective function is given in a form of the expectation. A basic difficulty of solving such stochastic optimization problems is that … In this paper we consider optimization problems where the objective function is given in a form of the expectation. A basic difficulty of solving such stochastic optimization problems is that the involved multidimensional integrals (expectations) cannot be computed with high accuracy. The aim of this paper is to compare two computational approaches based on Monte Carlo sampling techniques, namely, the stochastic approximation (SA) and the sample average approximation (SAA) methods. Both approaches, the SA and SAA methods, have a long history. Current opinion is that the SAA method can efficiently use a specific (say, linear) structure of the considered problem, while the SA approach is a crude subgradient method, which often performs poorly in practice. We intend to demonstrate that a properly modified SA approach can be competitive and even significantly outperform the SAA method for a certain class of convex stochastic problems. We extend the analysis to the case of convex-concave stochastic saddle point problems and present (in our opinion highly encouraging) results of numerical experiments.
We introduce and systematically investigate Bessel potential spaces associated with a real-valued continuous negative definite function. These spaces can be regarded as (higher order) $L_p$-variants of translation invariant Dirichlet spaces … We introduce and systematically investigate Bessel potential spaces associated with a real-valued continuous negative definite function. These spaces can be regarded as (higher order) $L_p$-variants of translation invariant Dirichlet spaces and in gener
We exhibit an approximate equivalence between the Lasso estimator and Dantzig selector. For both methods we derive parallel oracle inequalities for the prediction risk in the general nonparametric regression model, … We exhibit an approximate equivalence between the Lasso estimator and Dantzig selector. For both methods we derive parallel oracle inequalities for the prediction risk in the general nonparametric regression model, as well as bounds on the $\ell_p$ estimation loss for $1\le p\le 2$ in the linear model when the number of variables can be much larger than the sample size.
A constructive theory of randomness for functions, based on computational complexity, is developed, and a pseudorandom function generator is presented. This generator is a deterministic polynomial-time algorithm that transforms pairs … A constructive theory of randomness for functions, based on computational complexity, is developed, and a pseudorandom function generator is presented. This generator is a deterministic polynomial-time algorithm that transforms pairs ( g , r ), where g is any one-way function and r is a random k -bit string, to polynomial-time computable functions ƒ r : {1, … , 2 k } → {1, … , 2 k }. These ƒ r 's cannot be distinguished from random functions by any probabilistic polynomial-time algorithm that asks and receives the value of a function at arguments of its choice. The result has applications in cryptography, random constructions, and complexity theory.
This paper involves certain types of spatial-temporal models constructed from Levy bases. The dynamics is described by a field of stochastic processes , on a set of sites , defined … This paper involves certain types of spatial-temporal models constructed from Levy bases. The dynamics is described by a field of stochastic processes , on a set of sites , defined as integrals where denotes a Levy basis. The integrands are deterministic functions of the form , where has a special form and is a subset of . The first topic is OU (Ornstein-Uhlenbeck) fields , which represent certain extensions of the concept of OU processes (processes of Ornstein-Uhlenbeck type); the focus here is mainly on the potential of for dynamic modelling. Applications to dynamical spatial processes of Cox type are briefly indicated. The second part of the paper discusses modelling of spatial-temporal correlations of SI (stochastic intermittency) fields of the form This form is useful when explicitly computing expectations of the form which are used to characterize correlations. The SI fields can be viewed as a dynamical, continuous, and homogeneous generalization of turbulent cascades. In this connection an SI field is constructed with spatial-temporal scaling behaviour that agrees with the energy dissipation observed in turbulent flows. Some parallels of this construction are also briefly sketched.
In recent years, social networking platforms have developed into extraordinary channels for spreading and consuming information. Along with the rise of such infrastructure, there is continuous progress on techniques for … In recent years, social networking platforms have developed into extraordinary channels for spreading and consuming information. Along with the rise of such infrastructure, there is continuous progress on techniques for spreading information effectively through influential users. In many applications, one is restricted to select influencers from a set of users who engaged with the topic being promoted, and due to the structure of social networks, these users often rank low in terms of their influence potential. An alternative approach one can consider is an adaptive method which selects users in a manner which targets their influential neighbors. The advantage of such an approach is that it leverages the friendship paradox in social networks: while users are often not influential, they often know someone who is. Despite the various complexities in such optimization problems, we show that scalable adaptive seeding is achievable. In particular, we develop algorithms for linear influence models with provable approximation guarantees that can be gracefully parallelized. To show the effectiveness of our methods we collected data from various verticals social network users follow. For each vertical, we collected data on the users who responded to a certain post as well as their neighbors, and applied our methods on this data. Our experiments show that adaptive seeding is scalable, and importantly, that it obtains dramatic improvements over standard approaches of information dissemination.
Random matrices are widely used in sparse recovery problems, and the relevant properties of matrices with i.i.d. entries are well understood. This paper discusses the recently introduced restricted eigenvalue (RE) … Random matrices are widely used in sparse recovery problems, and the relevant properties of matrices with i.i.d. entries are well understood. This paper discusses the recently introduced restricted eigenvalue (RE) condition, which is among the most general assumptions on the matrix, guaranteeing recovery. We prove a reduction principle showing that the RE condition can be guaranteed by checking the restricted isometry on a certain family of low-dimensional subspaces. This principle allows us to establish the RE condition for several broad classes of random matrices with dependent entries, including random matrices with sub-Gaussian rows and nontrivial covariance structure, as well as matrices with independent rows, and uniformly bounded entries.
A theorem exhibiting the duality between certain infinite systems of interacting stochastic processes and a type of branching process is proved. This duality is then used to study the ergodic … A theorem exhibiting the duality between certain infinite systems of interacting stochastic processes and a type of branching process is proved. This duality is then used to study the ergodic properties of the infinite system. In the case of the vector model a complete understanding of the ergodic behavior is obtained.
A strengthened central limit theorem for densities is established showing monotone convergence in the sense of relative entropy. A strengthened central limit theorem for densities is established showing monotone convergence in the sense of relative entropy.
For the problem of minimizing a lower semicontinuous proper convex function f on a Hilbert space, the proximal point algorithm in exact form generates a sequence $\{ z^k \} $ … For the problem of minimizing a lower semicontinuous proper convex function f on a Hilbert space, the proximal point algorithm in exact form generates a sequence $\{ z^k \} $ by taking $z^{k + 1} $ to be the minimizes of $f(z) + ({1 / {2c_k }})\| {z - z^k } \|^2 $, where $c_k > 0$. This algorithm is of interest for several reasons, but especially because of its role in certain computational methods based on duality, such as the Hestenes-Powell method of multipliers in nonlinear programming. It is investigated here in a more general form where the requirement for exact minimization at each iteration is weakened, and the subdifferential $\partial f$ is replaced by an arbitrary maximal monotone operator T. Convergence is established under several criteria amenable to implementation. The rate of convergence is shown to be “typically” linear with an arbitrarily good modulus if $c_k $ stays large enough, in fact superlinear if $c_k \to \infty $. The case of $T = \partial f$ is treated in extra detail. Application is also made to a related case corresponding to minimax problems.
For real Lévy processes $(X_t)_{t \geq 0}$ having no Brownian component with Blumenthal-Getoor index $\beta$, the estimate $E \sup_{s \leq t} |X_s - a_p s|^p \leq C_p t$ for every … For real Lévy processes $(X_t)_{t \geq 0}$ having no Brownian component with Blumenthal-Getoor index $\beta$, the estimate $E \sup_{s \leq t} |X_s - a_p s|^p \leq C_p t$ for every $t \in [0,1]$ and suitable $a_p \in R$ has been established by Millar for $\beta < p \leq 2$ provided $X_1 \in L^p$. We derive extensions of these estimates to the cases $p < 2$ and $p \leq\beta$.
Part 1 Basic properties: definition and elementary properties continuity theorems and inversion formulas criteria inequalities characteristic functions and movements, expansion of characteristic functions, asymptotic behaviour unimodality analycity of characteristic functions … Part 1 Basic properties: definition and elementary properties continuity theorems and inversion formulas criteria inequalities characteristic functions and movements, expansion of characteristic functions, asymptotic behaviour unimodality analycity of characteristic functions multivariate characteristic functions. Part 2 Inequalities: auxiliary results inequalities for characteristic functions of distributions with bounded support moment inequalities estimates for the characteristic functions of unimodal distributions estimates for the characteristic functions of absolutely continuous distributions estimates for the characteristic functions of discrete distributions inequalities for multivariate characteristic functions inequalities involving integrals of characteristic functions inequalities involving differences of characteristic functions estimates for the positive zero of a characteristic function. Part 3 Empirical characteristic functions: definition and basic properties asymptotic properties of empirical characteristic functions the first positive zero parameter estimation non-parametric density estimation 1 non-parametric density estimation 2 tests for independence tests for symmetry testing for normality goodness-of-fit tests based on empirical characteristic functions.
We extend the standard scale-free network model to include a "triad formation step." We analyze the geometric properties of networks generated by this algorithm both analytically and by numerical calculations, … We extend the standard scale-free network model to include a "triad formation step." We analyze the geometric properties of networks generated by this algorithm both analytically and by numerical calculations, and find that our model possesses the same characteristics as the standard scale-free networks such as the power-law degree distribution and the small average geodesic length, but with the high clustering at the same time. In our model, the clustering coefficient is also shown to be tunable simply by changing a control parameter---the average number of triad formation trials per time step.
We develop a principled way of identifying probability distributions whose independent and identically distributed (iid) realizations are compressible, i.e., can be well-approximated as sparse. We focus on Gaussian random underdetermined … We develop a principled way of identifying probability distributions whose independent and identically distributed (iid) realizations are compressible, i.e., can be well-approximated as sparse. We focus on Gaussian random underdetermined linear regression (GULR) problems, where compressibility is known to ensure the success of estimators exploiting sparse regularization. We prove that many distributions revolving around maximum a posteriori (MAP) interpretation of sparse regularized estimators are in fact incompressible, in the limit of large problem sizes. A highlight is the Laplace distribution and $\ell^{1}$ regularized estimators such as the Lasso and Basis Pursuit denoising. To establish this result, we identify non-trivial undersampling regions in GULR where the simple least squares solution almost surely outperforms an oracle sparse solution, when the data is generated from the Laplace distribution. We provide simple rules of thumb to characterize classes of compressible (respectively incompressible) distributions based on their second and fourth moments. Generalized Gaussians and generalized Pareto distributions serve as running examples for concreteness.