Author Description

Login to generate an author description

Ask a Question About This Mathematician

We study two related problems motivated by molecular biology. • Given a graph G and a constant k, does there exist a supergraph $G'$ of G that is a unit … We study two related problems motivated by molecular biology. • Given a graph G and a constant k, does there exist a supergraph $G'$ of G that is a unit interval graph and has clique size at most k? • Given a graph G and a proper k-coloring c of G, does there exist a supergraph $G'$ of G that is properly colored by c and is a unit interval graph? We show that those problems are polynomial for fixed k. On the other hand, we prove that the first problem is equivalent to deciding if the bandwidth of G is at most $k - 1$. Hence, it is NP-hard and $W[t]$-hard for all t. We also show that the second problem is $W[1]$-hard for all t-hard. This implies that for fixed k, both of the problems are unlikely to have an $O(n^\alpha )$ algorithm, where a is a constant independent of k. A central tool in our study is a new graph-theoretic parameter closely related to pathwidth. An unexpected useful consequence is the equivalence of this parameter to the bandwidth of the graph.
The Fréchet distance measures similarity between two curves $f$ and $g$ that takes into account the ordering of the points along the two curves: Informally, it is the minimum length … The Fréchet distance measures similarity between two curves $f$ and $g$ that takes into account the ordering of the points along the two curves: Informally, it is the minimum length of a leash required to connect a dog, walking along $f$, and its owner, walking along $g$, as they walk without backtracking along their respective curves from one endpoint to the other. The discrete Fréchet distance replaces the dog and its owner by a pair of frogs that can only reside on $m$ and $n$ specific stones, respectively. The stones are in fact sequences of points, typically sampled from the respective curves $f$ and $g$. These frogs hop from one stone to the next without backtracking, and the discrete Fréchet distance is the minimum length of a "leash" that connects the frogs and allows them to execute such a sequence of hops from the starting points to the terminal points of their sequences. The discrete Fréchet distance can be computed in $O(mn)$ time by a straightforward dynamic programming algorithm. We present the first subquadratic algorithm for computing the discrete Fréchet distance between two sequences of points in the plane. Assuming $m\le n$, the algorithm runs in $O(\frac{mn\log\log n}{\log n})$ time, in the word RAM model, using $O(n)$ storage. Our approach uses the geometry of the problem in a subtle way to encode legal positions of the frogs as states of a finite automaton.
We show that the number of unit distances determined by n points in ℝ 3 is O ( n 3/2 ), slightly improving the bound of Clarkson, Edelsbrunner, Guibas, Sharir … We show that the number of unit distances determined by n points in ℝ 3 is O ( n 3/2 ), slightly improving the bound of Clarkson, Edelsbrunner, Guibas, Sharir and Welzl [5], established in 1990. The new proof uses the recently introduced polynomial partitioning technique of Guth and Katz [12]. While this paper was still in a draft stage, a similar proof of our main result was posted to the arXiv by Joshua Zahl [28].
Article Free Access Share on Purely functional representations of catenable sorted lists Authors: Haim Kaplan Department of Computer Science, Princeton University, Princeton, NJ Department of Computer Science, Princeton University, Princeton, … Article Free Access Share on Purely functional representations of catenable sorted lists Authors: Haim Kaplan Department of Computer Science, Princeton University, Princeton, NJ Department of Computer Science, Princeton University, Princeton, NJView Profile , Robert E. Tarjan Department of Computer Science, Princeton University, Princeton, NJ Department of Computer Science, Princeton University, Princeton, NJView Profile Authors Info & Claims STOC '96: Proceedings of the twenty-eighth annual ACM symposium on Theory of ComputingJuly 1996Pages 202–211https://doi.org/10.1145/237814.237865Published:01 July 1996Publication History 35citation672DownloadsMetricsTotal Citations35Total Downloads672Last 12 Months71Last 6 weeks8 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteeReaderPDF
Many datasets, including market basket data, text or hypertext documents, and events recorded in different locations or time periods, can be modeled as a collection of sets over a ground … Many datasets, including market basket data, text or hypertext documents, and events recorded in different locations or time periods, can be modeled as a collection of sets over a ground set of keys. Common queries over such data, including similarity or association rules are represented as the weight or selectivity of keys that satisfy some selection predicate defined over keys' attributes and memberships in particular sets.
Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time … Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at multiple locations, and records with multiple numeric attributes. Over such vector-weighted data we are interested in aggregates with respect to one set of weights, such as weighted sums, and aggregates over multiple sets of weights such as the L 1 difference. Sample-based summarization is highly effective for data sets that are too large to be stored or manipulated. The summary facilitates approximate processing queries that may be specified after the summary was generated. Current designs, however, are geared for data sets where a single scalar weight is associated with each key. We develop a sampling framework based on coordinated weighted samples that is suited for multiple weight assignments and obtain estimators that are orders of magnitude tighter than previously possible. We demonstrate the power of our methods through an extensive empirical evaluation on diverse data sets ranging from IP network to stock quotes data.
Previous chapter Next chapter Full AccessProceedings Proceedings of the 2013 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)Computing the Discrete Fréchet Distance in Subquadratic TimePankaj K. Agarwal, Rinat Ben Avraham, Haim … Previous chapter Next chapter Full AccessProceedings Proceedings of the 2013 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)Computing the Discrete Fréchet Distance in Subquadratic TimePankaj K. Agarwal, Rinat Ben Avraham, Haim Kaplan, and Micha SharirPankaj K. Agarwal, Rinat Ben Avraham, Haim Kaplan, and Micha Sharirpp.156 - 167Chapter DOI:https://doi.org/10.1137/1.9781611973105.12PDFBibTexSections ToolsAdd to favoritesExport CitationTrack CitationsEmail SectionsAboutAbstract The Fréchet distance is a similarity measure between two curves A and B that takes into account the location and ordering of the points along the two curves: Informally, it is the minimum length of a leash required to connect a dog, walking along A, and its owner, walking along B, as they walk without backtracking along their respective curves from one endpoint to the other. The discrete Fréchet distance replaces the dog and its owner by a pair of frogs that can only reside on n and m specific stones on the curves A and B, respectively. These frogs hop from one stone to the next without backtracking, and the discrete Fréchet distance is the minimum length of a "leash" that connects the frogs and allows them to execute such a sequence of hops. It can be computed in quadratic time by a straightforward dynamic programming algorithm. We present the first subquadratic algorithm for computing the discrete Fréchet distance between two sequences of points in the plane. Assuming m ≤ n, the algorithm runs in time, in the standard RAM model, using O(n) storage. Our approach uses the geometry of the problem in a subtle way to encode legal positions of the frogs as states of a finite automaton. Previous chapter Next chapter RelatedDetails Published:2013ISBN:978-1-61197-251-1eISBN:978-1-61197-310-5 https://doi.org/10.1137/1.9781611973105Book Series Name:ProceedingsBook Code:PR143Book Pages:xix + 1915
The best known upper bound on the number of topological changes in the Delaunay triangulation of a set of moving points in ℜ2 is (nearly) cubic, even if each point … The best known upper bound on the number of topological changes in the Delaunay triangulation of a set of moving points in ℜ2 is (nearly) cubic, even if each point is moving with a fixed velocity. We introduce the notion of a stable Delaunay graph (SDG in short), a dynamic subgraph of the Delaunay triangulation, that is less volatile in the sense that it undergoes fewer topological changes and yet retains many useful properties of the full Delaunay triangulation. SDG is defined in terms of a parameter ± > 0, and consists of Delaunay edges pq for which the (equal) angles at which p and q see the corresponding Voronoi edge epq are at least ±. We prove several interesting properties of SDG and describe two kinetic data structures for maintaining it. Both structures use O*(n) storage. They process O*(n2) events during the motion, each in O*(1) time, provided that the points of P move along algebraic trajectories of bounded degree; the O*(·) notation hides multiplicative factors that are polynomial in 1/± and polylogarithmic in n. The first structure is simpler but the dependency on 1/± in its performance is higher.
Previous chapter Next chapter Full AccessProceedings Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms (SODA)Competitive Analysis with a Sample and the Secretary ProblemHaim Kaplan, David Naori, and Danny RazHaim … Previous chapter Next chapter Full AccessProceedings Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms (SODA)Competitive Analysis with a Sample and the Secretary ProblemHaim Kaplan, David Naori, and Danny RazHaim Kaplan, David Naori, and Danny Razpp.2082 - 2095Chapter DOI:https://doi.org/10.1137/1.9781611975994.128PDFBibTexSections ToolsAdd to favoritesExport CitationTrack CitationsEmail SectionsAboutAbstract We extend the standard online worst-case model to accommodate past experience which is available to the online player in many practical scenarios. We do this by revealing a random sample of the adversarial input to the online player ahead of time. The online player competes with the expected optimal value on the part of the input that arrives online. Our model bridges between existing online stochastic models (e.g., items are drawn i.i.d. from a distribution) and the online worst-case model. We also extend in a similar manner (by revealing a sample) the online random-order model. We study the classical secretary problem in our new models. In the worst-case model we present a simple online algorithm with optimal competitive-ratio for any sample size. In the random-order model, we also give a simple online algorithm with an almost tight competitive-ratio for small sample sizes. Interestingly, we prove that for a large enough sample, no algorithm can be simultaneously optimal both in the worst-cast and random-order models. Previous chapter Next chapter RelatedDetails Published:2020eISBN:978-1-61197-599-4 https://doi.org/10.1137/1.9781611975994Book Series Name:ProceedingsBook Code:PRDA20Book Pages:xxii + 3011
The Fibonacci heap was devised to provide an especially efficient implementation of Dijkstra's shortest path algorithm. Although asyptotically efficient, it is not as fast in practice as other heap implementations. … The Fibonacci heap was devised to provide an especially efficient implementation of Dijkstra's shortest path algorithm. Although asyptotically efficient, it is not as fast in practice as other heap implementations. Expanding on ideas of Høyer [1995], we describe three heap implementations (two versions of thin heaps and one of thick heaps ) that have the same amortized efficiency as Fibonacci heaps, but need less space and promise better practical performance. As part of our development, we fill in a gap in Høyer's analysis.
We describe a way of assigning labels to the vertices of any undirected graph on up to n vertices, each composed of n/2+O(1) bits, such that given the labels of … We describe a way of assigning labels to the vertices of any undirected graph on up to n vertices, each composed of n/2+O(1) bits, such that given the labels of two vertices, and no other information regarding the graph, it is possible to decide whether or not the vertices are adjacent in the graph. This is optimal, up to an additive constant, and constitutes the first improvement in almost 50 years of an n/2+O(log n) bound of Moon. As a consequence, we obtain an induced-universal graph for n-vertex graphs containing only O(2n/2) vertices, which is optimal up to a multiplicative constant, solving an open problem of Vizing from 1968. We obtain similar tight results for directed graphs, tournaments and bipartite graphs.
We present an efficient construction of additively weighted Voronoi diagrams on planar graphs. Let G be a planar graph with n vertices and b sites that lie on a constant … We present an efficient construction of additively weighted Voronoi diagrams on planar graphs. Let G be a planar graph with n vertices and b sites that lie on a constant number of faces. We show how to preprocess G in Õ(nb2) time1 so that one can compute any additively weighted Voronoi diagram for these sites in Õ(b) time.We use this construction to compute the diameter of a directed planar graph with real arc lengths in Õ(n5/3) time. This improves the recent breakthrough result of Cabello (SODA'17), both by improving the running time (from Õ(n11/6)), and by providing a deterministic algorithm. It is in fact the first truly subquadratic deterministic algorithm for this problem. Our use of Voronoi diagrams to compute the diameter follows that of Cabello, but he used abstract Voronoi diagrams, which makes his diameter algorithm more involved, more expensive, and randomized.As in Cabello's work, our algorithm can also compute the Wiener index of a planar graph (i.e., the sum of all pairwise distances) within the same bound. Our construction of Voronoi diagrams for planar graphs is of independent interest. It has already been used to obtain fast exact distance oracles for planar graphs [Cohen-Addad et al., FOCS'17].
We describe a new data structure for dynamic nearest neighbor queries in the plane with respect to a general family of distance functions that includes Lp-norms and additively weighted Euclidean … We describe a new data structure for dynamic nearest neighbor queries in the plane with respect to a general family of distance functions that includes Lp-norms and additively weighted Euclidean distances, and for general (convex, pair- wise disjoint) sites that have constant description complexity (line segments, disks, etc.). Our data structure has a polylogarithmic update and query time, improving an earlier data structure of Agarwal, Efrat and Sharir that required O(n∊) time for an update and O(log n) time for a query [1]. Our data structure has numerous applications, and in all of them it gives faster algorithms, typically reducing an O(n∊) factor in the bounds to polylogarithmic. To further demonstrate its effectiveness, we give here two new applications: an efficient construction of a spanner in a disk intersection graph, and a data structure for efficient connectivity queries in a dynamic disk graph.To obtain this data structure, we combine and extend various techniques and obtain several side results that are of independent interest. Our data structure depends on the existence and an efficient construction of “vertical” shallow cuttings in arrangements of bivariate algebraic functions. We prove that an appropriate level in an arrangement of a random sample of a suitable size provides such a cutting. To compute it efficiently, we develop a randomized incremental construction algorithm for finding the lowest k levels in an arrangement of bivariate algebraic functions (we mostly consider here collections of functions whose lower envelope has linear complexity, as is the case in the dynamic nearest- neighbor context). To analyze this algorithm, we improve a longstanding bound on the combinatorial complexity of the vertical decomposition of these levels. Finally, to obtain our structure, we plug our vertical shallow cutting construction into Chan's algorithm for efficiently maintaining the lower envelope of a dynamic set of planes in ℝ3. While doing this, we also revisit Chan's technique and present a variant that uses a single binary counter, with a simpler analysis and an improved amortized deletion time.
We study the sample complexity of learning threshold functions under the constraint of differential privacy. It is assumed that each labeled example in the training data is the information of … We study the sample complexity of learning threshold functions under the constraint of differential privacy. It is assumed that each labeled example in the training data is the information of one individual and we would like to come up with a generalizing hypothesis $h$ while guaranteeing differential privacy for the individuals. Intuitively, this means that any single labeled example in the training data should not have a significant effect on the choice of the hypothesis. This problem has received much attention recently; unlike the non-private case, where the sample complexity is independent of the domain size and just depends on the desired accuracy and confidence, for private learning the sample complexity must depend on the domain size $X$ (even for approximate differential privacy). Alon et al. (STOC 2019) showed a lower bound of $Ω(\log^*|X|)$ on the sample complexity and Bun et al. (FOCS 2015) presented an approximate-private learner with sample complexity $\tilde{O}\left(2^{\log^*|X|}\right)$. In this work we reduce this gap significantly, almost settling the sample complexity. We first present a new upper bound (algorithm) of $\tilde{O}\left(\left(\log^*|X|\right)^2\right)$ on the sample complexity and then present an improved version with sample complexity $\tilde{O}\left(\left(\log^*|X|\right)^{1.5}\right)$. Our algorithm is constructed for the related interior point problem, where the goal is to find a point between the largest and smallest input elements. It is based on selecting an input-dependent hash function and using it to embed the database into a domain whose size is reduced logarithmically; this results in a new database, an interior point of which can be used to generate an interior point in the original database in a differentially private manner.
We describe a new data structure for dynamic nearest neighbor queries in the plane with respect to a general family of distance functions. These include $L_p$-norms and additively weighted Euclidean … We describe a new data structure for dynamic nearest neighbor queries in the plane with respect to a general family of distance functions. These include $L_p$-norms and additively weighted Euclidean distances. Our data structure supports general (convex, pairwise disjoint) sites that have constant description complexity (e.g., points, line segments, disks, etc.). Our structure uses $O(n \log^3 n)$ storage, and requires polylogarithmic update and query time, improving an earlier data structure of Agarwal, Efrat and Sharir that required $O(n^\varepsilon)$ time for an update and $O(\log n)$ time for a query [SICOMP, 1999]. Our data structure has numerous applications. In all of them, it gives faster algorithms, typically reducing an $O(n^\varepsilon)$ factor in the previous bounds to polylogarithmic. In addition, we give here two new applications: an efficient construction of a spanner in a disk intersection graph, and a data structure for efficient connectivity queries in a dynamic disk graph.
A streaming algorithm is said to be adversarially robust if its accuracy guarantees are maintained even when the data stream is chosen maliciously, by an adaptive adversary . We establish … A streaming algorithm is said to be adversarially robust if its accuracy guarantees are maintained even when the data stream is chosen maliciously, by an adaptive adversary . We establish a connection between adversarial robustness of streaming algorithms and the notion of differential privacy . This connection allows us to design new adversarially robust streaming algorithms that outperform the current state-of-the-art constructions for many interesting regimes of parameters.
The average distance from a node to all other nodes in a graph, or from a query point in a metric space to a set of points, is a fundamental … The average distance from a node to all other nodes in a graph, or from a query point in a metric space to a set of points, is a fundamental quantity in data analysis. The inverse of the average distance, known as the (classic) closeness centrality of a node, is a popular importance measure in the study of social networks. We develop novel structural insights on the sparsifiability of the distance relation via weighted sampling. Based on that, we present highly practical algorithms with strong statistical guarantees for fundamental problems. We show that the average distance (and hence the centrality) for all nodes in a graph can be estimated using $O(ε^{-2})$ single-source distance computations. For a set $V$ of $n$ points in a metric space, we show that after preprocessing which uses $O(n)$ distance computations we can compute a weighted sample $S\subset V$ of size $O(ε^{-2})$ such that the average distance from any query point $v$ to $V$ can be estimated from the distances from $v$ to $S$. Finally, we show that for a set of points $V$ in a metric space, we can estimate the average pairwise distance using $O(n+ε^{-2})$ distance computations. The estimate is based on a weighted sample of $O(ε^{-2})$ pairs of points, which is computed using $O(n)$ distance computations. Our estimates are unbiased with normalized mean square error (NRMSE) of at most $ε$. Increasing the sample size by a $O(\log n)$ factor ensures that the probability that the relative error exceeds $ε$ is polynomially small.
We describe a way of assigning labels to the vertices of any undirected graph on up to $n$ vertices, each composed of $n/2+O(1)$ bits, such that given the labels of … We describe a way of assigning labels to the vertices of any undirected graph on up to $n$ vertices, each composed of $n/2+O(1)$ bits, such that given the labels of two vertices, and no other information regarding the graph, it is possible to decide whether or not the vertices are adjacent in the graph. This is optimal, up to an additive constant, and constitutes the first improvement in almost 50 years of an $n/2+O(\log n)$ bound of Moon. As a consequence, we obtain an induced-universal graph for $n$-vertex graphs containing only $O(2^{n/2})$ vertices, which is optimal up to a multiplicative constant, solving an open problem of Vizing from 1968. We obtain similar tight results for directed graphs, tournaments, and bipartite graphs.
We use soft heaps to obtain simpler optimal algorithms for selecting the $k$-th smallest item, and the set of~$k$ smallest items, from a heap-ordered tree, from a collection of sorted … We use soft heaps to obtain simpler optimal algorithms for selecting the $k$-th smallest item, and the set of~$k$ smallest items, from a heap-ordered tree, from a collection of sorted lists, and from $X+Y$, where $X$ and $Y$ are two unsorted sets. Our results match, and in some ways extend and improve, classical results of Frederickson (1993) and Frederickson and Johnson (1982). In particular, for selecting the $k$-th smallest item, or the set of~$k$ smallest items, from a collection of~$m$ sorted lists we obtain a new optimal "output-sensitive" algorithm that performs only $O(m+\sum_{i=1}^m \log(k_i+1))$ comparisons, where $k_i$ is the number of items of the $i$-th list that belong to the overall set of~$k$ smallest items.
Hub Labeling (HL) is a data structure for distance oracles. Hierarchical HL (HHL) is a special type of HL, that received a lot of attention from a practical point of … Hub Labeling (HL) is a data structure for distance oracles. Hierarchical HL (HHL) is a special type of HL, that received a lot of attention from a practical point of view. However, theoretical questions such as NP-hardness and approximation guarantee for HHL algorithms have been left aside. In this paper we study HL and HHL from the complexity theory point of view. We prove that both HL and HHL are NP-hard, and present upper and lower bounds for the approximation ratios of greedy HHL algorithms used in practice. We also introduce a new variant of the greedy HHL algorithm and a proof that it produces small labels for graphs with small highway dimension.
Previous chapter Next chapter Full AccessProceedings Proceedings of the 2009 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)Stream sampling for variance-optimal estimation of subset sumsEdith Cohen, Nick Duffield, Haim Kaplan, Carsten … Previous chapter Next chapter Full AccessProceedings Proceedings of the 2009 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)Stream sampling for variance-optimal estimation of subset sumsEdith Cohen, Nick Duffield, Haim Kaplan, Carsten Lund, and Mikkel ThorupEdith Cohen, Nick Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thoruppp.1255 - 1264Chapter DOI:https://doi.org/10.1137/1.9781611973068.136PDFBibTexSections ToolsAdd to favoritesExport CitationTrack CitationsEmail SectionsAboutAbstract From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time, which is optimal even on the word RAM. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting. Previous chapter Next chapter RelatedDetails Published:2009ISBN:978-0-89871-680-1eISBN:978-1-61197-306-8 https://doi.org/10.1137/1.9781611973068Book Series Name:ProceedingsBook Code:PR132Book Pages:xviii + 1288
Let $P$ be a set of $n$ points in an axis-parallel rectangle $B$ in the plane. We present an $O(n\alpha(n)\log^4 n)$-time algorithm to preprocess $P$ into a data structure of … Let $P$ be a set of $n$ points in an axis-parallel rectangle $B$ in the plane. We present an $O(n\alpha(n)\log^4 n)$-time algorithm to preprocess $P$ into a data structure of size $O(n\alpha(n)\log^3 n)$, such that, given a query point $q$, we can find, in $O(\log^4 n)$ time, the largest-area axis-parallel rectangle that is contained in $B$, contains $q$, and its interior contains no point of $P$. This is a significant improvement over the previous solution of Augustine {\em et al.} \cite{qmex}, which uses slightly superquadratic preprocessing and storage.
We use soft heaps to obtain simpler optimal algorithms for selecting the $k$-th smallest item, and the set of~$k$ smallest items, from a heap-ordered tree, from a collection of sorted … We use soft heaps to obtain simpler optimal algorithms for selecting the $k$-th smallest item, and the set of~$k$ smallest items, from a heap-ordered tree, from a collection of sorted lists, and from $X+Y$, where $X$ and $Y$ are two unsorted sets. Our results match, and in some ways extend and improve, classical results of Frederickson (1993) and Frederickson and Johnson (1982). In particular, for selecting the $k$-th smallest item, or the set of~$k$ smallest items, from a collection of~$m$ sorted lists we obtain a new optimal output-sensitive algorithm that performs only $O(m+\sum_{i=1}^m \log(k_i+1))$ comparisons, where $k_i$ is the number of items of the $i$-th list that belong to the overall set of~$k$ smallest items.
Abstract Suppose we are given a set D of n pairwise intersecting disks in the plane. A planar point set P stabs D if and only if each disk in … Abstract Suppose we are given a set D of n pairwise intersecting disks in the plane. A planar point set P stabs D if and only if each disk in D contains at least one point from P . We present a deterministic algorithm that takes O ( n ) time to find five points that stab D . Furthermore, we give a simple example of 13 pairwise intersecting disks that cannot be stabbed by three points. Moreover, we present a simple argument showing that eight disks can be stabbed by at most three points. This provides a simple – albeit slightly weaker – algorithmic version of a classical result by Danzer that such a set D can always be stabbed by four points.
The Fréchet distance is a well-studied similarity measure between curves. The discrete Fréchet distance is an analogous similarity measure, defined for two sequences of m and n points, where the … The Fréchet distance is a well-studied similarity measure between curves. The discrete Fréchet distance is an analogous similarity measure, defined for two sequences of m and n points, where the points are usually sampled from input curves. We consider a variant, called the discrete Fréchet distance with shortcuts , which captures the similarity between (sampled) curves in the presence of outliers. When shortcuts are allowed only in one noise-containing curve, we give a randomized algorithm that runs in O (( m + n ) 6/5 + ε ) expected time, for any ε > 0. When shortcuts are allowed in both curves, we give an O (( m 2/3 n 2/3 + m + n )log 3 ( m + n ))-time deterministic algorithm. We also consider the semicontinuous Fréchet distance with one-sided shortcuts, where we have a sequence of m points and a polygonal curve of n edges, and shortcuts are allowed only in the sequence. We show that this problem can be solved in randomized expected time O (( m + n ) 2/3 m 2/3 n 1/3 log ( m + n )). Our techniques are novel and may find further applications. One of the main new technical results is: Given two sets of points A and B in the plane and an interval I , we develop an algorithm that decides whether the number of pairs ( x , y ) ∈ A × B whose distance dist( x , y ) is in I is less than some given threshold L . The running time of this algorithm decreases as L increases. In case there are more than L pairs of points whose distance is in I , we can get a small sample of pairs that contain a pair at approximate median distance (i.e., we can approximately “bisect” I ). We combine this procedure with additional ideas to search, with a small overhead, for the optimal one-sided Fréchet distance with shortcuts, using a very fast decision procedure. We also show how to apply this technique for approximating distance selection (with respect to rank), and a somewhat more involved variant of this technique is used in the solution of the semicontinuous Fréchet distance with one-sided shortcuts. In general, the new technique can be applied to optimization problems for which the decision procedure is very fast but standard techniques like parametric search makes the optimization algorithm substantially slower.
We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, we are given a Markov Decision Process (MDP) without an explicit reward function. Instead, … We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, we are given a Markov Decision Process (MDP) without an explicit reward function. Instead, we observe an expert that acts according to some policy, and the goal is to find a policy whose feature expectations are closest to those of the expert policy. We formulate this problem as finding the projection of the feature expectations of the expert on the feature expectations polytope – the convex hull of the feature expectations of all the deterministic policies in the MDP. We show that this formulation is equivalent to the AL objective and that solving this problem using the FW algorithm is equivalent well-known Projection method of Abbeel and Ng (2004). This insight allows us to analyze AL with tools from convex optimization literature and derive tighter convergence bounds on AL. Specifically, we show that a variation of the FW method that is based on taking “away steps” achieves a linear rate of convergence when applied to AL and that a stochastic version of the FW algorithm can be used to avoid precise estimation of feature expectations. We also experimentally show that this version outperforms the FW baseline. To the best of our knowledge, this is the first work that shows linear convergence rates for AL.
We present an explicit and efficient construction of additively weighted Voronoi diagrams on planar graphs. Let $G$ be a planar graph with $n$ vertices and $b$ sites that lie on … We present an explicit and efficient construction of additively weighted Voronoi diagrams on planar graphs. Let $G$ be a planar graph with $n$ vertices and $b$ sites that lie on a constant number of faces. We show how to preprocess $G$ in $\tilde O(nb^2)$ time so that one can compute any additively weighted Voronoi diagram for these sites in $\tilde O(b)$ time. We use this construction to compute the diameter of a directed planar graph with real arc lengths in $\tilde{O}(n^{5/3})$ time. This improves the recent breakthrough result of Cabello [SODA 2017, SIAM, Philadelphia, 2017, pp. 2143--2152], both by improving the running time (from $\tilde{O}(n^{11/6})$), and by providing a deterministic algorithm. It is in fact the first truly subquadratic deterministic algorithm for this problem. Our use of Voronoi diagrams to compute the diameter follows that of Cabello, but he used abstract Voronoi diagrams, which makes his diameter algorithm more involved, more expensive, and randomized. As in Cabello's work, our algorithm can compute, for every vertex $v$, both the farthest vertex from $v$ (i.e., the eccentricity of $v$), and the sum of distances from $v$ to all other vertices. Hence, our algorithm can also compute the radius, median, and Wiener index (sum of all pairwise distances) of a planar graph within the same time bounds. Our construction of Voronoi diagrams for planar graphs is of independent interest.
Previous chapter Next chapter Full AccessProceedings Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)Online Weighted Matching with a SampleHaim Kaplan, David Naori, and Danny RazHaim Kaplan, David … Previous chapter Next chapter Full AccessProceedings Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)Online Weighted Matching with a SampleHaim Kaplan, David Naori, and Danny RazHaim Kaplan, David Naori, and Danny Razpp.1247 - 1272Chapter DOI:https://doi.org/10.1137/1.9781611977073.52PDFBibTexSections ToolsAdd to favoritesExport CitationTrack CitationsEmail SectionsAboutAbstract We study the greedy-based online algorithm for edge-weighted matching with (one-sided) vertex arrivals in bipartite graphs, and edge arrivals in general graphs. This algorithm was first studied more than a decade ago by Korula and Pál for the bipartite case in the random-order model. While the weighted bipartite matching problem is solved in the random-order model, this is not the case in recent and exciting online models in which the online player is provided with a sample, and the arrival order is adversarial. The greedy-based algorithm is arguably the most natural and practical algorithm to be applied in these models. Despite its simplicity and appeal, and despite being studied in multiple works, the greedy-based algorithm was not fully understood in any of the studied online models, and its actual performance remained an open question for more than a decade. We provide a thorough analysis of the greedy-based algorithm in several online models. For vertex arrivals in bipartite graphs, we characterize the exact competitive-ratio of this algorithm in the random-order model, for any arrival order of the vertices subsequent to the sampling phase (adversarial and random orders in particular). We use it to derive tight analysis in the recent adversarial-order model with a sample (AOS model) for any sample size, providing the first result in this model beyond the simple secretary problem. Then, we generalize and strengthen the black box method of converting results in the random-order model to single-sample prophet inequalities, and use it to derive the state-of-the-art single-sample prophet inequality for the problem. Finally, we use our new techniques to analyze the greedy-based algorithm for edge arrivals in general graphs and derive results in all the mentioned online models. In this case as well, we improve upon the state-of-the-art single-sample prophet inequality. Previous chapter Next chapter RelatedDetails Published:2022eISBN:978-1-61197-707-3 https://doi.org/10.1137/1.9781611977073Book Series Name:ProceedingsBook Code:PRDA22Book Pages:xvii + 3771
We study mechanisms for an allocation of goods among agents, where agents have no incentive to lie about their true values (incentive compatible) and for which no agent will seek … We study mechanisms for an allocation of goods among agents, where agents have no incentive to lie about their true values (incentive compatible) and for which no agent will seek to exchange outcomes with another (envy-free). Mechanisms satisfying each requirement separately have been studied extensively, but there are few results on mechanisms achieving both. We are interested in those allocations for which there exist payments such that the resulting mechanism is simultaneously incentive compatible and envy-free. Cyclic monotonicity is a characterization of incentive compatible allocations, local efficiency is a characterization for envy-free allocations. We combine the above to give a characterization for allocations which are both incentive compatible and envy free. We show that even for allocations that allow payments leading to incentive compatible mechanisms, and other payments leading to envy free mechanisms, there may not exist any payments for which the mechanism is simultaneously incentive compatible and envy-free. The characterization that we give lets us compute the set of Pareto-optimal mechanisms that trade off envy freeness for incentive compatibility.
We study envy-free mechanisms for scheduling tasks on unrelated machines (agents) that approximately minimize the makespan. For indivisible tasks, we put forward an envy-free poly-time mechanism that approximates the minimal … We study envy-free mechanisms for scheduling tasks on unrelated machines (agents) that approximately minimize the makespan. For indivisible tasks, we put forward an envy-free poly-time mechanism that approximates the minimal makespan to within a factor of O(log m), where m is the number of machines. We also show a lower bound of γ(log m / log log m). This improves the recent result of Mu'alem [22] who give an upper bound of (m+1)/2, and a lower bound of 2-1/m. For divisible tasks, we show that there always exists an envy-free poly-time mechanism with optimal makespan. Finally, we demonstrate how our mechanism for envy free makespan minimization can be interpreted as a market clearing problem.
We derive and analyze learning algorithms for apprenticeship learning, policy evaluation, and policy gradient for average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In … We derive and analyze learning algorithms for apprenticeship learning, policy evaluation, and policy gradient for average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time.
Let C=C1,...,Cn} be a set of n pairwise-disjoint convex s-gons, for some constant s, and let p be a probability density function (pdf) over the non-negative reals. For each i, … Let C=C1,...,Cn} be a set of n pairwise-disjoint convex s-gons, for some constant s, and let p be a probability density function (pdf) over the non-negative reals. For each i, let Ki be the Minkowski sum of Ci with a disk of radius ri, where each ri is a random non-negative number drawn independently from the distribution determined by π. We show that the expected complexity of the union of K1,..,Kn is O(n log n), for any pdf p; the constant of proportionality depends on s, but not on the pdf. Next, we consider the following problem that arises in analyzing the vulnerability of a network under a physical attack. Let G=(V,E) be a planar geometric graph where E is a set of n line segments with pairwise-disjoint relative interiors. Let f: R -> [0,1] be an edge failure probability function, where a physical attack at a location x causes an edge e of E at distance r from x to fail with probability f(r); we assume that f is of the form f(x)=1-P(x), where P is a cumulative distribution function on the non-negative reals. The goal is to compute the most vulnerable location for G, i.e., the location of the attack that maximizes the expected number of failing edges of G. Using our bound on the complexity of the union of random Minkowski sums, we present a near-linear Monte-Carlo algorithm for computing a location that is an approximately most vulnerable location of attack for G.
The average distance from a node to all other nodes in a graph, or from a query point in a metric space to a set of points, is a fundamental … The average distance from a node to all other nodes in a graph, or from a query point in a metric space to a set of points, is a fundamental quantity in data analysis. The inverse of the average distance, known as the (classic) closeness centrality of a node, is a popular importance measure in the study of social networks. We develop novel structural insights on the sparsifiability of the distance relation via weighted sampling. Based on that, we present highly practical algorithms with strong statistical guarantees for fundamental problems. We show that the average distance (and hence the centrality) for all nodes in a graph can be estimated using O(epsilon^{-2}) single-source distance computations. For a set V of n points in a metric space, we show that after preprocessing which uses O(n) distance computations we can compute a weighted sample S subset of V of size O(epsilon^{-2}) such that the average distance from any query point v to V can be estimated from the distances from v to S. Finally, we show that for a set of points V in a metric space, we can estimate the average pairwise distance using O(n+epsilon^{-2}) distance computations. The estimate is based on a weighted sample of O(epsilon^{-2}) pairs of points, which is computed using O(n) distance computations. Our estimates are unbiased with normalized mean square error (NRMSE) of at most epsilon. Increasing the sample size by a O(log(n)) factor ensures that the probability that the relative error exceeds epsilon is polynomially small.
We review the theory of, and develop algorithms for transforming a finite point set in ${\bf R}^d$ into a set in \emph{radial isotropic position} by a nonsingular linear transformation followed … We review the theory of, and develop algorithms for transforming a finite point set in ${\bf R}^d$ into a set in \emph{radial isotropic position} by a nonsingular linear transformation followed by rescaling each image point to the unit sphere. This problem arises in a wide spectrum of applications in computer science and mathematics. Our algorithms use gradient descent methods for a particular convex function $f$ whose minimum defines the transformation, and our main focus is on analyzing their performance. Although the minimum can be computed exactly, by expensive symbolic algebra techniques, gradient descent only approximates the desired minimum to any desired level of accuracy. We show that computing the gradient of $f$ amounts to computing the Singular Value Decomposition (SVD) of a certain matrix associated with the input set, making it simple to implement. We believe it to be superior to other approximate techniques (mainly the ellipsoid algorithm) used previously to find this transformation, and it should run much faster in practice. We prove that $f$ is smooth, which yields convergence rate proportional to $1/ε$, where $ε$ is the desired approximation accuracy. To complete the analysis, we provide upper bounds on the norm of the optimal solution which depend on new parameters measuring "the degeneracy" in our input. We believe that our parameters capture degeneracy better than other, seemingly weaker, parameters used in previous works. We next analyze the strong convexity of $f$, and present two worst-case lower bounds on the smallest eigenvalue of its Hessian. This gives another worst-case bound on the convergence rate of another variant of gradient decent that depends only logarithmically on $1/ε$.
Let $P \subset \mathbb{R}^2$ be a planar $n$-point set such that each point $p \in P$ has an associated radius $r_p > 0$. The transmission graph $G$ for $P$ is … Let $P \subset \mathbb{R}^2$ be a planar $n$-point set such that each point $p \in P$ has an associated radius $r_p > 0$. The transmission graph $G$ for $P$ is the directed graph with vertex set $P$ such that for any $p, q \in P$, there is an edge from $p$ to $q$ if and only if $d(p, q) \leq r_p$. Let $t > 1$ be a constant. A $t$-spanner for $G$ is a subgraph $H \subseteq G$ with vertex set $P$ so that for any two vertices $p,q \in P$, we have $d_H(p, q) \leq t d_G(p, q)$, where $d_H$ and $d_G$ denote the shortest path distance in $H$ and $G$, respectively (with Euclidean edge lengths). We show how to compute a $t$-spanner for $G$ with $O(n)$ edges in $O(n (\log n + \log \Psi))$ time, where $\Psi$ is the ratio of the largest and smallest radius of a point in $P$. Using more advanced data structures, we obtain a construction that runs in $O(n \log^5 n)$ time, independent of $\Psi$. We give two applications for our spanners. First, we show how to use our spanner to find a BFS tree in $G$ from any given start vertex in $O(n \log n)$ time (in addition to the time it takes to build the spanner). Second, we show how to use our spanner to extend a reachability oracle to answer geometric reachability queries. In a geometric reachability query we ask whether a vertex $p$ in $G$ can "reach" a target $q$ which is an arbitrary point in the plane (rather than restricted to be another vertex $q$ of $G$ in a standard reachability query). Our spanner allows the reachability oracle to answer geometric reachability queries with an additive overhead of $O(\log n\log \Psi)$ to the query time and $O(n \log \Psi)$ to the space.
The Fibonacci heap is a classic data structure that supports deletions in logarithmic amortized time and all other heap operations in O(1) amortized time. We explore the design space of … The Fibonacci heap is a classic data structure that supports deletions in logarithmic amortized time and all other heap operations in O(1) amortized time. We explore the design space of this data structure. We propose a version with the following improvements over the original: (i) Each heap is represented by a single heap-ordered tree, instead of a set of trees. (ii) Each decrease-key operation does only one cut and a cascade of rank changes, instead of doing a cascade of cuts. (iii) The outcomes of all comparisons done by the algorithm are explicitly represented in the data structure, so none are wasted. We also give an example to show that without cascading cuts or rank changes, both the original data structure and the new version fail to have the desired efficiency, solving an open problem of Fredman. Finally, we illustrate the richness of the design space by proposing several alternative ways to do cascading rank changes, including a randomized strategy related to one previously proposed by Karger. We leave the analysis of these alternatives as intriguing open problems.
We present a randomized $\tilde{O}(n^{3.5})$-time algorithm for computing \emph{optimal energetic paths} for an electric car between all pairs of vertices in an $n$-vertex directed graph with positive and negative \emph{costs}. … We present a randomized $\tilde{O}(n^{3.5})$-time algorithm for computing \emph{optimal energetic paths} for an electric car between all pairs of vertices in an $n$-vertex directed graph with positive and negative \emph{costs}. The optimal energetic paths are finite and well-defined even if the graph contains negative-cost cycles. This makes the problem much more challenging than standard shortest paths problems. More specifically, for every two vertices $s$ and~$t$ in the graph, the algorithm computes $\alpha_B(s,t)$, the maximum amount of charge the car can reach~$t$ with, if it starts at~$s$ with full battery, i.e., with charge~$B$, where~$B$ is the capacity of the battery. In the presence of negative-cost cycles, optimal paths are not necessarily simple. For dense graphs, our new $\tilde{O}(n^{3.5})$ time algorithm improves on a previous $\tilde{O}(mn^{2})$-time algorithm of Dorfman et al. [ESA 2023] for the problem. The \emph{cost} of an arc is the amount of charge taken from the battery of the car when traversing the arc. The charge in the battery can never exceed the capacity~$B$ of the battery and can never be negative. An arc of negative cost may correspond, for example, to a downhill road segment, while an arc with a positive cost may correspond to an uphill segment. A negative-cost cycle, if one exists, can be used in certain cases to charge the battery to its capacity. This makes the problem more interesting and more challenging. Negative-cost cycles may arise when certain road segments have magnetic charging strips, or when the electric car has solar panels. Combined with a result of Dorfman et al. [SOSA 2024], this also provides a randomized $\tilde{O}(n^{3.5})$-time algorithm for computing \emph{minimum-cost paths} between all pairs of vertices in an $n$-vertex graph when the battery can be externally recharged, at varying costs, at intermediate vertices.
We consider the problem of computing optimal search trees on trees (STTs). STTs generalize binary search trees (BSTs) in which we search nodes in a path (linear order) to search … We consider the problem of computing optimal search trees on trees (STTs). STTs generalize binary search trees (BSTs) in which we search nodes in a path (linear order) to search trees that facilitate search over general tree topologies. Golinsky proposed a linear programming (LP) relaxation of the problem of computing an optimal static STT over a given tree topology. He used this LP formulation to compute an STT that is a $2$-approximation to an optimal STT, and conjectured that it is, in fact, an extended formulation of the convex-hull of all depths-vectors of STTs, and thus always gives an optimal solution. In this work we study this LP approach further. We show that the conjecture is false and that Golinsky's LP does not always give an optimal solution. To show this we use what we call the ``normals method''. We use this method to enumerate over vertices of Golinsky's polytope for all tree topologies of no more than 8 nodes. We give a lower bound on the integrality gap of the LP and on the approximation ratio of Golinsky's rounding method. We further enumerate several research directions that can lead to the resolution of the question whether one can compute an optimal STT in polynomial time.
We introduce efficient differentially private (DP) algorithms for several linear algebraic tasks, including solving linear equalities over arbitrary fields, linear inequalities over the reals, and computing affine spans and convex … We introduce efficient differentially private (DP) algorithms for several linear algebraic tasks, including solving linear equalities over arbitrary fields, linear inequalities over the reals, and computing affine spans and convex hulls. As an application, we obtain efficient DP algorithms for learning halfspaces and affine subspaces. Our algorithms addressing equalities are strongly polynomial, whereas those addressing inequalities are weakly polynomial. Furthermore, this distinction is inevitable: no DP algorithm for linear programming can be strongly polynomial-time efficient.
We revisit the fundamental question of formally defining what constitutes a reconstruction attack. While often clear from the context, our exploration reveals that a precise definition is much more nuanced … We revisit the fundamental question of formally defining what constitutes a reconstruction attack. While often clear from the context, our exploration reveals that a precise definition is much more nuanced than it appears, to the extent that a single all-encompassing definition may not exist. Thus, we employ a different strategy and aim to "sandwich" the concept of reconstruction attacks by addressing two complementing questions: (i) What conditions guarantee that a given system is protected against such attacks? (ii) Under what circumstances does a given attack clearly indicate that a system is not protected? More specifically, * We introduce a new definitional paradigm -- Narcissus Resiliency -- to formulate a security definition for protection against reconstruction attacks. This paradigm has a self-referential nature that enables it to circumvent shortcomings of previously studied notions of security. Furthermore, as a side-effect, we demonstrate that Narcissus resiliency captures as special cases multiple well-studied concepts including differential privacy and other security notions of one-way functions and encryption schemes. * We formulate a link between reconstruction attacks and Kolmogorov complexity. This allows us to put forward a criterion for evaluating when such attacks are convincingly successful.
An electric car equipped with a battery of a finite capacity travels on a road network with an infrastructure of charging stations. Each charging station has a possibly different cost … An electric car equipped with a battery of a finite capacity travels on a road network with an infrastructure of charging stations. Each charging station has a possibly different cost per unit of energy. Traversing a given road segment requires a specified amount of energy that may be positive, zero or negative. The car can only traverse a road segment if it has enough charge to do so (the charge cannot drop below zero), and it cannot charge its battery beyond its capacity. To travel from one point to another the car needs to choose a \emph{travel plan} consisting of a path in the network and a recharging schedule that specifies how much energy to charge at each charging station on the path, making sure of having enough energy to reach the next charging station or the destination. The cost of the plan is the total charging cost along the chosen path. We reduce the problem of computing plans between every two junctions of the network to two problems: Finding optimal energetic paths when no charging is allowed and finding standard shortest paths. When there are no negative cycles in the network, we obtain an $O(n^3)$-time algorithm for computing all-pairs travel plans, where~$n$ is the number of junctions in the network. We obtain slightly faster algorithms under some further assumptions. We also consider the case in which a bound is placed on the number of rechargings allowed.
Recent advances in algorithmic design show how to utilize predictions obtained by machine learning models from past and present data. These approaches have demonstrated an enhancement in performance when the … Recent advances in algorithmic design show how to utilize predictions obtained by machine learning models from past and present data. These approaches have demonstrated an enhancement in performance when the predictions are accurate, while also ensuring robustness by providing worst-case guarantees when predictions fail. In this paper we focus on online problems; prior research in this context was focused on a paradigm where the predictor is pre-trained on past data and then used as a black box (to get the predictions it was trained for). In contrast, in this work, we unpack the predictor and integrate the learning problem it gives rise for within the algorithmic challenge. In particular we allow the predictor to learn as it receives larger parts of the input, with the ultimate goal of designing online learning algorithms specifically tailored for the algorithmic task at hand. Adopting this perspective, we focus on a number of fundamental problems, including caching and scheduling, which have been well-studied in the black-box setting. For each of the problems we consider, we introduce new algorithms that take advantage of explicit learning algorithms which we carefully design towards optimizing the overall performance. We demonstrate the potential of our approach by deriving performance bounds which improve over those established in previous work.
An electric car equipped with a battery of a finite capacity travels on a road network with an infrastructure of charging stations. Each charging station has a possibly different cost … An electric car equipped with a battery of a finite capacity travels on a road network with an infrastructure of charging stations. Each charging station has a possibly different cost per unit of energy. Traversing a given road segment requires a specified amount of energy that may be positive, zero or negative. The car can only traverse a road segment if it has enough charge to do so (the charge cannot drop below zero), and it cannot charge its battery beyond its capacity. To travel from one point to another the car needs to choose a travel plan consisting of a path in the network and a recharging schedule that specifies how much energy to charge at each charging station on the path, making sure of having enough energy to reach the next charging station or the destination. The cost of the plan is the total charging cost along the chosen path. We reduce the problem of computing plans between every two junctions of the network to two problems: Finding optimal energetic paths when no charging is allowed and finding standard shortest paths. When there are no negative cycles in the network, we obtain an O(n3)-time algorithm for computing all-pairs travel plans, where n is the number of junctions in the network. We obtain slightly faster algorithms under some further assumptions. We also consider the case in which a bound is placed on the number of rechargings allowed.
Let S ⊆ ℝ2 be a set of n sites in the plane, so that every site s ∈ S has an associated radius rs > 0. Let D(S) be … Let S ⊆ ℝ2 be a set of n sites in the plane, so that every site s ∈ S has an associated radius rs > 0. Let D(S) be the disk intersection graph defined by S, i.e., the graph with vertex set S and an edge between two distinct sites s, t ∈ S if and only if the disks with centers s, t and radii rs, rt intersect. Our goal is to design data structures that maintain the connectivity structure of D(S) as S changes dynamically over time.
Binary search trees (BSTs) are one of the most basic and widely used data structures. The best static tree for serving a sequence of queries (searches) can be computed by … Binary search trees (BSTs) are one of the most basic and widely used data structures. The best static tree for serving a sequence of queries (searches) can be computed by dynamic programming. In contrast, when the BSTs are allowed to be dynamic (i.e. change by rotations between searches), we still do not know how to compute the optimal algorithm (OPT) for a given sequence. One of the candidate algorithms whose serving cost is suspected to be optimal up-to a (multiplicative) constant factor is known by the name Greedy Future (GF). In an equivalent geometric way of representing queries on BSTs, GF is in fact equivalent to another algorithm called Geometric Greedy (GG). Most of the results on GF are obtained using the geometric model and the study of GG. Despite this intensive recent fruitful research, the best lower bound we have on the competitive ratio of GF is $\frac{4}{3}$. Furthermore, it has been conjectured that the additive gap between the cost of GF and OPT is only linear in the number of queries. In this paper we prove a lower bound of $2$ on the competitive ratio of GF, and we prove that the additive gap between the cost of GF and OPT can be $\Omega(m \cdot \log\log n)$ where $n$ is the number of items in the tree and $m$ is the number of queries.
Binary search trees (BSTs) are one of the most basic and widely used data structures. The best static tree for serving a sequence of queries (searches) can be computed by … Binary search trees (BSTs) are one of the most basic and widely used data structures. The best static tree for serving a sequence of queries (searches) can be computed by dynamic programming. In contrast, when the BSTs are allowed to be dynamic (i.e. change by rotations between searches), we still do not know how to compute the optimal algorithm (OPT) for a given sequence. One of the candidate algorithms whose serving cost is suspected to be optimal up-to a (multiplicative) constant factor is known by the name Greedy Future (GF). In an equivalent geometric way of representing queries on BSTs, GF is in fact equivalent to another algorithm called Geometric Greedy (GG). Most of the results on GF are obtained using the geometric model and the study of GG. Despite this intensive recent fruitful research, the best lower bound we have on the competitive ratio of GF is $\frac{4}{3}$. Furthermore, it has been conjectured that the additive gap between the cost of GF and OPT is only linear in the number of queries. In this paper we prove a lower bound of $2$ on the competitive ratio of GF, and we prove that the additive gap between the cost of GF and OPT can be $Ω(m \cdot \log\log n)$ where $n$ is the number of items in the tree and $m$ is the number of queries.
We study the online facility location problem with uniform facility costs in the random-order model. Meyerson's algorithm [FOCS'01] is arguably the most natural and simple online algorithm for the problem … We study the online facility location problem with uniform facility costs in the random-order model. Meyerson's algorithm [FOCS'01] is arguably the most natural and simple online algorithm for the problem with several advantages and appealing properties. Its analysis in the random-order model is one of the cornerstones of random-order analysis beyond the secretary problem. Meyerson's algorithm was shown to be (asymptotically) optimal in the standard worst-case adversarial-order model and 8-competitive in the random order model. While this bound in the random-order model is the long-standing state-of-the-art, it is not known to be tight, and the true competitive-ratio of Meyerson's algorithm remained an open question for more than two decades.We resolve this question and prove tight bounds on the competitive-ratio of Meyerson's algorithm in the random-order model, showing that it is exactly 4-competitive. Following our tight analysis, we introduce a generic parameterized version of Meyerson's algorithm that retains all the advantages of the original version. We show that the best algorithm in this family is exactly 3-competitive. On the other hand, we show that no online algorithm for this problem can achieve a competitive-ratio better than 2. Finally, we prove that the algorithms in this family are robust to partial adversarial arrival orders.
We introduce the concurrent shuffle model of differential privacy. In this model we have multiple concurrent shufflers permuting messages from different, possibly overlapping, batches of users. Similarly to the standard … We introduce the concurrent shuffle model of differential privacy. In this model we have multiple concurrent shufflers permuting messages from different, possibly overlapping, batches of users. Similarly to the standard (single) shuffle model, the privacy requirement is that the concatenation of all shuffled messages should be differentially private. We study the private continual summation problem (a.k.a. the counter problem) and show that the concurrent shuffle model allows for significantly improved error compared to a standard (single) shuffle model. Specifically, we give a summation algorithm with error $\tilde{O}(n^{1/(2k+1)})$ with $k$ concurrent shufflers on a sequence of length $n$. Furthermore, we prove that this bound is tight for any $k$, even if the algorithm can choose the sizes of the batches adaptively. For $k=\log n$ shufflers, the resulting error is polylogarithmic, much better than $\tilde{\Theta}(n^{1/3})$ which we show is the smallest possible with a single shuffler. We use our online summation algorithm to get algorithms with improved regret bounds for the contextual linear bandit problem. In particular we get optimal $\tilde{O}(\sqrt{n})$ regret with $k= \tilde{\Omega}(\log n)$ concurrent shufflers.
In this work we introduce an interactive variant of joint differential privacy towards handling online processes in which existing privacy definitions seem too restrictive. We study basic properties of this … In this work we introduce an interactive variant of joint differential privacy towards handling online processes in which existing privacy definitions seem too restrictive. We study basic properties of this definition and demonstrate that it satisfies (suitable variants) of group privacy, composition, and post processing. We then study the cost of interactive joint privacy in the basic setting of online classification. We show that any (possibly non-private) learning rule can be effectively transformed to a private learning rule with only a polynomial overhead in the mistake bound. This demonstrates a stark difference with more restrictive notions of privacy such as the one studied by Golowich and Livni (2021), where only a double exponential overhead on the mistake bound is known (via an information theoretic upper bound).
A weighted directed graph $G=(V,A,c)$, where $A\subseteq V\times V$ and $c:A\to R$, describes a road network in which an electric car can roam. An arc $uv$ models a road segment … A weighted directed graph $G=(V,A,c)$, where $A\subseteq V\times V$ and $c:A\to R$, describes a road network in which an electric car can roam. An arc $uv$ models a road segment connecting the two vertices $u$ and $v$. The cost $c(uv)$ of an arc $uv$ is the amount of energy the car needs to traverse the arc. This amount may be positive, zero or negative. To make the problem realistic, we assume there are no negative cycles. The car has a battery that can store up to $B$ units of energy. It can traverse an arc $uv\in A$ only if it is at $u$ and the charge $b$ in its battery satisfies $b\ge c(uv)$. If it traverses the arc, it reaches $v$ with a charge of $\min(b-c(uv),B)$. Arcs with positive costs deplete the battery, arcs with negative costs charge the battery, but not above its capacity of $B$. Given $s,t\in V$, can the car travel from $s$ to $t$, starting at $s$ with an initial charge $b$, where $0\le b\le B$? If so, what is the maximum charge with which the car can reach $t$? Equivalently, what is the smallest $\delta_{B,b}(s,t)$ such that the car can reach $t$ with a charge of $b-\delta_{B,b}(s,t)$, and which path should the car follow to achieve this? We refer to $\delta_{B,b}(s,t)$ as the energetic cost of traveling from $s$ to $t$. We let $\delta_{B,b}(s,t)=\infty$ if the car cannot travel from $s$ to $t$ starting with an initial charge of $b$. The problem of computing energetic costs is a strict generalization of the standard shortest paths problem. We show that the single-source minimum energetic paths problem can be solved using simple, but subtle, adaptations of the Bellman-Ford and Dijkstra algorithms. To make Dijkstra's algorithm work in the presence of negative arcs, but no negative cycles, we use a variant of the $A^*$ search heuristic. These results are explicit or implicit in some previous papers. We provide a simpler and unified description of these algorithms.
Let $S \subseteq \mathbb{R}^2$ be a set of $n$ \emph{sites} in the plane, so that every site $s \in S$ has an \emph{associated radius} $r_s > 0$. Let $D(S)$ be … Let $S \subseteq \mathbb{R}^2$ be a set of $n$ \emph{sites} in the plane, so that every site $s \in S$ has an \emph{associated radius} $r_s > 0$. Let $D(S)$ be the \emph{disk intersection graph} defined by $S$, i.e., the graph with vertex set $S$ and an edge between two distinct sites $s, t \in S$ if and only if the disks with centers $s$, $t$ and radii $r_s$, $r_t$ intersect. Our goal is to design data structures that maintain the connectivity structure of $D(S)$ as $S$ changes dynamically over time. We consider the incremental case, where new sites can be inserted into $S$. While previous work focuses on data structures whose running time depends on the ratio between the smallest and the largest site in $S$, we present a data structure with $O(\alpha(n))$ amortized query time and $O(\log^6 n)$ expected amortized insertion time.
We study the reverse shortest path problem on disk graphs in the plane. In this problem we consider the proximity graph of a set of $n$ disks in the plane … We study the reverse shortest path problem on disk graphs in the plane. In this problem we consider the proximity graph of a set of $n$ disks in the plane of arbitrary radii: In this graph two disks are connected if the distance between them is at most some threshold parameter $r$. The case of intersection graphs is a special case with $r=0$. We give an algorithm that, given a target length $k$, computes the smallest value of $r$ for which there is a path of length at most $k$ between some given pair of disks in the proximity graph. Our algorithm runs in $O^*(n^{5/4})$ randomized expected time, which improves to $O^*(n^{6/5})$ for unit disk graphs, where all the disks have the same radius. Our technique is robust and can be applied to many variants of the problem. One significant variant is the case of weighted proximity graphs, where edges are assigned real weights equal to the distance between the disks or between their centers, and $k$ is replaced by a target weight $w$; that is, we seek a path whose length is at most $w$. In other variants, we want to optimize a parameter different from $r$, such as a scale factor of the radii of the disks. The main technique for the decision version of the problem (determining whether the graph with a given $r$ has the desired property) is based on efficient implementations of BFS (for the unweighted case) and of Dijkstra's algorithm (for the weighted case), using efficient data structures for maintaining the bichromatic closest pair for certain bicliques and several distance functions. The optimization problem is then solved by combining the resulting decision procedure with enhanced variants of the interval shrinking and bifurcation technique of [4].
Motivated by the desire to utilize a limited number of configurable optical switches by recent advances in Software Defined Networks (SDNs), we define an online problem which we call the … Motivated by the desire to utilize a limited number of configurable optical switches by recent advances in Software Defined Networks (SDNs), we define an online problem which we call the Caching in Matchings problem. This problem has a natural combinatorial structure and therefore may find additional applications in theory and practice. In the Caching in Matchings problem our cache consists of $k$ matchings of connections between servers that form a bipartite graph. To cache a connection we insert it into one of the $k$ matchings possibly evicting at most two other connections from this matching. This problem resembles the problem known as Connection Caching, where we also cache connections but our only restriction is that they form a graph with bounded degree $k$. Our results show a somewhat surprising qualitative separation between the problems: The competitive ratio of any online algorithm for caching in matchings must depend on the size of the graph. Specifically, we give a deterministic $O(nk)$ competitive and randomized $O(n \log k)$ competitive algorithms for caching in matchings, where $n$ is the number of servers and $k$ is the number of matchings. We also show that the competitive ratio of any deterministic algorithm is $\Omega(\max(\frac{n}{k},k))$ and of any randomized algorithm is $\Omega(\log \frac{n}{k^2 \log k} \cdot \log k)$. In particular, the lower bound for randomized algorithms is $\Omega(\log n)$ regardless of $k$, and can be as high as $\Omega(\log^2 n)$ if $k=n^{1/3}$, for example. We also show that if we allow the algorithm to use at least $2k-1$ matchings compared to $k$ used by the optimum then we match the competitive ratios of connection catching which are independent of $n$. Interestingly, we also show that even a single extra matching for the algorithm allows to get substantially better bounds.
A streaming algorithm is said to be adversarially robust if its accuracy guarantees are maintained even when the data stream is chosen maliciously, by an adaptive adversary . We establish … A streaming algorithm is said to be adversarially robust if its accuracy guarantees are maintained even when the data stream is chosen maliciously, by an adaptive adversary . We establish a connection between adversarial robustness of streaming algorithms and the notion of differential privacy . This connection allows us to design new adversarially robust streaming algorithms that outperform the current state-of-the-art constructions for many interesting regimes of parameters.
We present efficient differentially private algorithms for learning unions of polygons in the plane (which are not necessarily convex). Our algorithms are $(\alpha,\beta)$--probably approximately correct and $(\varepsilon,\delta)$--differentially private using a … We present efficient differentially private algorithms for learning unions of polygons in the plane (which are not necessarily convex). Our algorithms are $(\alpha,\beta)$--probably approximately correct and $(\varepsilon,\delta)$--differentially private using a sample of size $\tilde{O}\left(\frac{1}{\alpha\varepsilon}k\log d\right)$, where the domain is $[d]\times[d]$ and $k$ is the number of edges in the union of polygons. Our algorithms are obtained by designing a private variant of the classical (nonprivate) learner for conjunctions using the greedy algorithm for set cover.
Given an input that undergoes a sequence of updates, a dynamic algorithm maintains a valid solution to some predefined problem at any point in time; the goal is to design … Given an input that undergoes a sequence of updates, a dynamic algorithm maintains a valid solution to some predefined problem at any point in time; the goal is to design an algorithm in which computing a solution to the updated input is done more efficiently than computing the solution from scratch. A dynamic algorithm against an adaptive adversary is required to be correct when the adversary chooses the next update after seeing the previous outputs of the algorithm. We obtain faster dynamic algorithms against an adaptive adversary and separation results between what is achievable in the oblivious vs. adaptive settings. To get these results we exploit techniques from differential privacy, cryptography, and adaptive data analysis. Our results are as follows.
Traffic splitting is a required functionality in networks, for example for load balancing over multiple paths or among different servers. The capacities of the servers determine the partition by which … Traffic splitting is a required functionality in networks, for example for load balancing over multiple paths or among different servers. The capacities of the servers determine the partition by which traffic should be split. A recent approach implements traffic splitting within the ternary content addressable memory (TCAM), which is often available in switches. It is important to reduce the amount of memory allocated for this task since TCAMs are power consuming and are often also required for other tasks such as classification and routing. Previous work showed how to compute the smallest prefix-matching TCAM necessary to implement a given partition exactly. In this paper we solve the more practical case, where at most $n$ prefix-matching TCAM rules are available, restricting the ability to implement exactly the desired partition. We give simple and efficient algorithms to find $n$ rules that generate a partition closest in $L_\infty$ to the desired one. We do the same for a one-sided version of $L_\infty$ which equals to the maximum overload on a server and for a relative version of it. We use our algorithms to evaluate how the expected error changes as a function of the number of rules, the number of servers, and the width of the TCAM.
Previous chapter Next chapter Full AccessProceedings Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)Online Weighted Matching with a SampleHaim Kaplan, David Naori, and Danny RazHaim Kaplan, David … Previous chapter Next chapter Full AccessProceedings Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)Online Weighted Matching with a SampleHaim Kaplan, David Naori, and Danny RazHaim Kaplan, David Naori, and Danny Razpp.1247 - 1272Chapter DOI:https://doi.org/10.1137/1.9781611977073.52PDFBibTexSections ToolsAdd to favoritesExport CitationTrack CitationsEmail SectionsAboutAbstract We study the greedy-based online algorithm for edge-weighted matching with (one-sided) vertex arrivals in bipartite graphs, and edge arrivals in general graphs. This algorithm was first studied more than a decade ago by Korula and Pál for the bipartite case in the random-order model. While the weighted bipartite matching problem is solved in the random-order model, this is not the case in recent and exciting online models in which the online player is provided with a sample, and the arrival order is adversarial. The greedy-based algorithm is arguably the most natural and practical algorithm to be applied in these models. Despite its simplicity and appeal, and despite being studied in multiple works, the greedy-based algorithm was not fully understood in any of the studied online models, and its actual performance remained an open question for more than a decade. We provide a thorough analysis of the greedy-based algorithm in several online models. For vertex arrivals in bipartite graphs, we characterize the exact competitive-ratio of this algorithm in the random-order model, for any arrival order of the vertices subsequent to the sampling phase (adversarial and random orders in particular). We use it to derive tight analysis in the recent adversarial-order model with a sample (AOS model) for any sample size, providing the first result in this model beyond the simple secretary problem. Then, we generalize and strengthen the black box method of converting results in the random-order model to single-sample prophet inequalities, and use it to derive the state-of-the-art single-sample prophet inequality for the problem. Finally, we use our new techniques to analyze the greedy-based algorithm for edge arrivals in general graphs and derive results in all the mentioned online models. In this case as well, we improve upon the state-of-the-art single-sample prophet inequality. Previous chapter Next chapter RelatedDetails Published:2022eISBN:978-1-61197-707-3 https://doi.org/10.1137/1.9781611977073Book Series Name:ProceedingsBook Code:PRDA22Book Pages:xvii + 3771
A $(\phi,\epsilon)$-expander-decomposition of a graph $G$ (with $n$ vertices and $m$ edges) is a partition of $V$ into clusters $V_1,\ldots,V_k$ with conductance $\Phi(G[V_i]) \ge \phi$, such that there are at … A $(\phi,\epsilon)$-expander-decomposition of a graph $G$ (with $n$ vertices and $m$ edges) is a partition of $V$ into clusters $V_1,\ldots,V_k$ with conductance $\Phi(G[V_i]) \ge \phi$, such that there are at most $\epsilon m$ inter-cluster edges. Such a decomposition plays a crucial role in many graph algorithms. We give a randomized $\tilde{O}(m/\phi)$ time algorithm for computing a $(\phi, \phi\log^2 {n})$-expander decomposition. This improves upon the $(\phi, \phi\log^3 {n})$-expander decomposition also obtained in $\tilde{O}(m/\phi)$ time by [Saranurak and Wang, SODA 2019] (SW) and brings the number of inter-cluster edges within logarithmic factor of optimal. One crucial component of SW's algorithm is non-stop version of the cut-matching game of [Khandekar, Rao, Vazirani, JACM 2009] (KRV): The cut player does not stop when it gets from the matching player an unbalanced sparse cut, but continues to play on a trimmed part of the large side. The crux of our improvement is the design of a non-stop version of the cleverer cut player of [Orecchia, Schulman, Vazirani, Vishnoi, STOC 2008] (OSVV). The cut player of OSSV uses a more sophisticated random walk, a subtle potential function, and spectral arguments. Designing and analysing a non-stop version of this game was an explicit open question asked by SW.
We study the online facility location problem with uniform facility costs in the random-order model. Meyerson's algorithm [FOCS'01] is arguably the most natural and simple online algorithm for the problem … We study the online facility location problem with uniform facility costs in the random-order model. Meyerson's algorithm [FOCS'01] is arguably the most natural and simple online algorithm for the problem with several advantages and appealing properties. Its analysis in the random-order model is one of the cornerstones of random-order analysis beyond the secretary problem. Meyerson's algorithm was shown to be (asymptotically) optimal in the standard worst-case adversarial-order model and $8$-competitive in the random order model. While this bound in the random-order model is the long-standing state-of-the-art, it is not known to be tight, and the true competitive-ratio of Meyerson's algorithm remained an open question for more than two decades. We resolve this question and prove tight bounds on the competitive-ratio of Meyerson's algorithm in the random-order model, showing that it is exactly $4$-competitive. Following our tight analysis, we introduce a generic parameterized version of Meyerson's algorithm that retains all the advantages of the original version. We show that the best algorithm in this family is exactly $3$-competitive. On the other hand, we show that no online algorithm for this problem can achieve a competitive-ratio better than $2$. Finally, we prove that the algorithms in this family are robust to partial adversarial arrival orders.
The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of … The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gyorfi, and Lugosi (1996) ask whether there exists a {monotone} Bayes-consistent algorithm. This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classification, using an intricate construction of a monotone Bayes-consistent algorithm. We derive a general result in multiclass classification, showing that every learning algorithm A can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to A. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye et al (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021). Our transformation readily implies monotone learners in a variety of contexts: for example it extends Pestov's result to classification tasks with an arbitrary number of labels. This is in contrast with Pestov's work which is tailored to binary classification. In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021).
Search trees on trees (STTs) generalize the fundamental binary search tree (BST) data structure: in STTs the underlying search space is an arbitrary tree, whereas in BSTs it is a … Search trees on trees (STTs) generalize the fundamental binary search tree (BST) data structure: in STTs the underlying search space is an arbitrary tree, whereas in BSTs it is a path. An optimal BST of size $n$ can be computed for a given distribution of queries in $O(n^2)$ time [Knuth 1971] and centroid BSTs provide a nearly-optimal alternative, computable in $O(n)$ time [Mehlhorn 1977]. By contrast, optimal STTs are not known to be computable in polynomial time, and the fastest constant-approximation algorithm runs in $O(n^3)$ time [Berendsohn, Kozma 2022]. Centroid trees can be defined for STTs analogously to BSTs, and they have been used in a wide range of algorithmic applications. In the unweighted case (i.e., for a uniform distribution of queries), a centroid tree can be computed in $O(n)$ time [Brodal et al. 2001; Della Giustina et al. 2019]. These algorithms, however, do not readily extend to the weighted case. Moreover, no approximation guarantees were previously known for centroid trees in either the unweighted or weighted cases. In this paper we revisit centroid trees in a general, weighted setting, and we settle both the algorithmic complexity of constructing them, and the quality of their approximation. For constructing a weighted centroid tree, we give an output-sensitive $O(n\log h)\subseteq O(n\log n)$ time algorithm, where $h$ is the height of the resulting centroid tree. If the weights are of polynomial complexity, the running time is $O(n\log\log n)$. We show these bounds to be optimal, in a general decision tree model of computation. For approximation, we prove that the cost of a centroid tree is at most twice the optimum, and this guarantee is best possible, both in the weighted and unweighted cases. We also give tight, fine-grained bounds on the approximation-ratio for bounded-degree trees and on the approximation-ratio of more general $\alpha$-centroid trees.
We construct a universally Bayes consistent learning rule that satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general … We construct a universally Bayes consistent learning rule that satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed, in the latter DP learning is extremely limited: even one-dimensional linear classifiers are not privately learnable in this stringent model. Our result thus demonstrates that by allowing the learning rate to depend on the target distribution, one can circumvent the above-mentioned impossibility result and in fact, learn \emph{arbitrary} distributions by a single DP algorithm. As an application, we prove that any VC class can be privately learned in a semi-supervised setting with a near-optimal \emph{labeled} sample complexity of $\tilde{O}(d/\varepsilon)$ labeled examples (and with an unlabeled sample complexity that can depend on the target distribution).
Traffic splitting is a required functionality in networks, for example for load balancing over paths or servers, or by the source's access restrictions. The capacities of the servers (or the … Traffic splitting is a required functionality in networks, for example for load balancing over paths or servers, or by the source's access restrictions. The capacities of the servers (or the number of users with particular access restrictions) determine the sizes of the parts into which traffic should be split. A recent approach implements traffic splitting within the ternary content addressable memory (TCAM), which is often available in switches. It is important to reduce the amount of memory allocated for this task since TCAMs are power consuming and are often also required for other tasks such as classification and routing. Recent works suggested algorithms to compute a smallest implementation of a given partition in the longest prefix match (LPM) model. In this paper we analyze properties of such minimal representations and prove lower and upper bounds on their size. The upper bounds hold for general TCAMs, and we also prove an additional lower-bound for general TCAMs. We also analyze the expected size of a representation, for uniformly random ordered partitions. We show that the expected representation size of a random partition is at least half the size for the worst-case partition, and is linear in the number of parts and in the logarithm of the size of the address space.
The literature on data sanitization aims to design algorithms that take an input dataset and produce a privacy-preserving version of it, that captures some of its statistical properties. In this … The literature on data sanitization aims to design algorithms that take an input dataset and produce a privacy-preserving version of it, that captures some of its statistical properties. In this note we study this question from a streaming perspective and our goal is to sanitize a data stream. Specifically, we consider low-memory algorithms that operate on a data stream and produce an alternative privacy-preserving stream that captures some statistical properties of the original input stream.
The literature on data sanitization aims to design algorithms that take an input dataset and produce a privacy-preserving version of it, that captures some of its statistical properties. In this … The literature on data sanitization aims to design algorithms that take an input dataset and produce a privacy-preserving version of it, that captures some of its statistical properties. In this note we study this question from a streaming perspective and our goal is to sanitize a data stream. Specifically, we consider low-memory algorithms that operate on a data stream and produce an alternative privacy-preserving stream that captures some statistical properties of the original input stream.
Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or a large number of data points that is … Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or a large number of data points that is required for accurate results. We propose a simple and practical tool $\mathsf{FriendlyCore}$ that takes a set of points ${\cal D}$ from an unrestricted (pseudo) metric space as input. When ${\cal D}$ has effective diameter $r$, $\mathsf{FriendlyCore}$ returns a stable subset ${\cal D}_G\subseteq {\cal D}$ that includes all points, except possibly few outliers, and is {\em certified} to have diameter $r$. $\mathsf{FriendlyCore}$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy. Surprisingly, $\mathsf{FriendlyCore}$ is light-weight with no dependence on the dimension. We empirically demonstrate its advantages in boosting the accuracy of mean estimation, outperforming tailored methods.
We give an $(\varepsilon,\delta)$-differentially private algorithm for the multi-armed bandit (MAB) problem in the shuffle model with a distribution-dependent regret of $O\left(\left(\sum_{a\in [k]:\Delta_a>0}\frac{\log T}{\Delta_a}\right)+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\varepsilon}\right)$, and a distribution-independent regret of … We give an $(\varepsilon,\delta)$-differentially private algorithm for the multi-armed bandit (MAB) problem in the shuffle model with a distribution-dependent regret of $O\left(\left(\sum_{a\in [k]:\Delta_a>0}\frac{\log T}{\Delta_a}\right)+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\varepsilon}\right)$, and a distribution-independent regret of $O\left(\sqrt{kT\log T}+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\varepsilon}\right)$, where $T$ is the number of rounds, $\Delta_a$ is the suboptimality gap of the arm $a$, and $k$ is the total number of arms. Our upper bound almost matches the regret of the best known algorithms for the centralized model, and significantly outperforms the best known algorithm in the local model.
We study the classical online bipartite matching problem: One side of the graph is known and vertices of the other side arrive online. It is well known that when the … We study the classical online bipartite matching problem: One side of the graph is known and vertices of the other side arrive online. It is well known that when the graph is edge-weighted, and vertices arrive in an adversarial order, no online algorithm has a nontrivial competitive-ratio. To bypass this hurdle we modify the rules such that the adversary still picks the graph but has to reveal a random part (say half) of it to the player. The remaining part is given to the player in an adversarial order. This models practical scenarios in which the online algorithm has some history to learn from. This way of modeling a history was formalized recently by the authors (SODA 20) and was called the AOS model (for Adversarial Online with a Sample). It allows developing online algorithms for the secretary problem that compete even when the secretaries arrive in an adversarial order. Here we use the same model to attack the much more challenging matching problem. We analyze a natural algorithmic framework that decides how to match an arriving vertex $v$ by applying an offline matching algorithm to $v$ and the sample. We get roughly $1/4$ of the maximum weight by applying the offline greedy matching algorithm to the sample and $v$. Our analysis ties the performance of this algorithm to the performance of the offline greedy matching on the online part and we also prove that it is tight. Surprisingly, when replacing greedy with an optimal algorithm for maximum matching, no constant competitive-ratio can be guaranteed when the size of the sample is comparable to the size of the online part. However, when the sample is quadratic in the size of the online part, we do get a competitive-ratio of $1/e$.
We study the greedy-based online algorithm for edge-weighted matching with (one-sided) vertex arrivals in bipartite graphs, and edge arrivals in general graphs. This algorithm was first studied more than a … We study the greedy-based online algorithm for edge-weighted matching with (one-sided) vertex arrivals in bipartite graphs, and edge arrivals in general graphs. This algorithm was first studied more than a decade ago by Korula and Pal for the bipartite case in the random-order model. While the weighted bipartite matching problem is solved in the random-order model, this is not the case in recent and exciting online models in which the online player is provided with a sample, and the arrival order is adversarial. The greedy-based algorithm is arguably the most natural and practical algorithm to be applied in these models. Despite its simplicity and appeal, and despite being studied in multiple works, the greedy-based algorithm was not fully understood in any of the studied online models, and its actual performance remained an open question for more than a decade. We provide a thorough analysis of the greedy-based algorithm in several online models. For vertex arrivals in bipartite graphs, we characterize the exact competitive-ratio of this algorithm in the random-order model, for any arrival order of the vertices subsequent to the sampling phase (adversarial and random orders in particular). We use it to derive tight analysis in the recent adversarial-order model with a sample (AOS model) for any sample size, providing the first result in this model beyond the simple secretary problem. Then, we generalize and strengthen the black box method of converting results in the random-order model to single-sample prophet inequalities, and use it to derive the state-of-the-art single-sample prophet inequality for the problem. Finally, we use our new techniques to analyze the greedy-based algorithm for edge arrivals in general graphs and derive results in all the mentioned online models. In this case as well, we improve upon the state-of-the-art single-sample prophet inequality.
Locality Sensitive Hashing (LSH) is an effective method of indexing a set of items to support efficient nearest neighbors queries in high-dimensional spaces. The basic idea of LSH is that … Locality Sensitive Hashing (LSH) is an effective method of indexing a set of items to support efficient nearest neighbors queries in high-dimensional spaces. The basic idea of LSH is that similar items should produce hash collisions with higher probability than dissimilar items. We study LSH for (not necessarily convex) polygons, and use it to give efficient data structures for similar shape retrieval. Arkin et al. represent polygons by their "turning function" - a function which follows the angle between the polygon's tangent and the $ x $-axis while traversing the perimeter of the polygon. They define the distance between polygons to be variations of the $ L_p $ (for $p=1,2$) distance between their turning functions. This metric is invariant under translation, rotation and scaling (and the selection of the initial point on the perimeter) and therefore models well the intuitive notion of shape resemblance. We develop and analyze LSH near neighbor data structures for several variations of the $ L_p $ distance for functions (for $p=1,2$). By applying our schemes to the turning functions of a collection of polygons we obtain efficient near neighbor LSH-based structures for polygons. To tune our structures to turning functions of polygons, we prove some new properties of these turning functions that may be of independent interest. As part of our analysis, we address the following problem which is of independent interest. Find the vertical translation of a function $ f $ that is closest in $ L_1 $ distance to a function $ g $. We prove tight bounds on the approximation guarantee obtained by the translation which is equal to the difference between the averages of $ g $ and $ f $.
We present a streaming problem for which every adversarially-robust streaming algorithm must use polynomial space, while there exists a classical (oblivious) streaming algorithm that uses only polylogarithmic space. This is … We present a streaming problem for which every adversarially-robust streaming algorithm must use polynomial space, while there exists a classical (oblivious) streaming algorithm that uses only polylogarithmic space. This is the first separation between oblivious streaming and adversarially-robust streaming, and resolves one of the central open questions in adversarial robust streaming.
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics. In each episode, the learner suffers the loss accumulated along … We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics. In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback: the trajectory is revealed along with the cumulative loss suffered, rather than the individual losses encountered along the trajectory. Our main result is a computationally efficient algorithm with $O(\sqrt{K})$ regret for this setting, where $K$ is the number of episodes. We establish this result via an efficient reduction to a novel bandit learning setting we call Distorted Linear Bandits (DLB), which is a variant of bandit linear optimization where actions chosen by the learner are adversarially distorted before they are committed. We then develop a computationally-efficient online algorithm for DLB for which we prove an $O(\sqrt{T})$ regret bound, where $T$ is the number of time steps. Our algorithm is based on online mirror descent with a self-concordant barrier regularization that employs a novel increasing learning rate schedule.
We present an explicit and efficient construction of additively weighted Voronoi diagrams on planar graphs. Let $G$ be a planar graph with $n$ vertices and $b$ sites that lie on … We present an explicit and efficient construction of additively weighted Voronoi diagrams on planar graphs. Let $G$ be a planar graph with $n$ vertices and $b$ sites that lie on a constant number of faces. We show how to preprocess $G$ in $\tilde O(nb^2)$ time so that one can compute any additively weighted Voronoi diagram for these sites in $\tilde O(b)$ time. We use this construction to compute the diameter of a directed planar graph with real arc lengths in $\tilde{O}(n^{5/3})$ time. This improves the recent breakthrough result of Cabello [SODA 2017, SIAM, Philadelphia, 2017, pp. 2143--2152], both by improving the running time (from $\tilde{O}(n^{11/6})$), and by providing a deterministic algorithm. It is in fact the first truly subquadratic deterministic algorithm for this problem. Our use of Voronoi diagrams to compute the diameter follows that of Cabello, but he used abstract Voronoi diagrams, which makes his diameter algorithm more involved, more expensive, and randomized. As in Cabello's work, our algorithm can compute, for every vertex $v$, both the farthest vertex from $v$ (i.e., the eccentricity of $v$), and the sum of distances from $v$ to all other vertices. Hence, our algorithm can also compute the radius, median, and Wiener index (sum of all pairwise distances) of a planar graph within the same time bounds. Our construction of Voronoi diagrams for planar graphs is of independent interest.
In this work we study the problem of differentially private (DP) quantiles, in which given dataset $X$ and quantiles $q_1, ..., q_m \in [0,1]$, we want to output $m$ quantile … In this work we study the problem of differentially private (DP) quantiles, in which given dataset $X$ and quantiles $q_1, ..., q_m \in [0,1]$, we want to output $m$ quantile estimations which are as close as possible to the true quantiles and preserve DP. We describe a simple recursive DP algorithm, which we call ApproximateQuantiles (AQ), for this task. We give a worst case upper bound on its error, and show that its error is much lower than of previous implementations on several different datasets. Furthermore, it gets this low error while running time two orders of magnitude faster that the best previous implementation.
Traffic splitting is a required functionality in networks, for example for load balancing over multiple paths or among different servers. The capacities of the servers determine the partition by which … Traffic splitting is a required functionality in networks, for example for load balancing over multiple paths or among different servers. The capacities of the servers determine the partition by which traffic should be split. A recent approach implements traffic splitting within the ternary content addressable memory (TCAM), which is often available in switches. It is important to reduce the amount of memory allocated for this task since TCAMs are power consuming and are often also required for other tasks such as classification and routing. Previous work showed how to compute the smallest prefix-matching TCAM necessary to implement a given partition exactly. In this paper we solve the more practical case, where at most $n$ prefix-matching TCAM rules are available, restricting the ability to implement exactly the desired partition. We give simple and efficient algorithms to find $n$ rules that generate a partition closest in $L_\infty$ to the desired one. We do the same for a one-sided version of $L_\infty$ which equals to the maximum overload on a server and for a relative version of it. We use our algorithms to evaluate how the expected error changes as a function of the number of rules, the number of servers, and the width of the TCAM.
Clustering is a fundamental problem in data analysis. In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points. Despite significant research … Clustering is a fundamental problem in data analysis. In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points. Despite significant research progress, the problem had so far resisted practical solutions. In this work we aim at providing simple implementable differentially private clustering algorithms that provide utility when the data is "easy," e.g., when there exists a significant separation between the clusters. We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results. We are able to get improved sample complexity bounds in some cases of Gaussian mixtures and $k$-means. We complement our theoretical analysis with an empirical evaluation on synthetic data.
A dynamic algorithm against an adaptive adversary is required to be correct when the adversary chooses the next update after seeing the previous outputs of the algorithm. We obtain faster … A dynamic algorithm against an adaptive adversary is required to be correct when the adversary chooses the next update after seeing the previous outputs of the algorithm. We obtain faster dynamic algorithms against an adaptive adversary and separation results between what is achievable in the oblivious vs. adaptive settings. To get these results we exploit techniques from differential privacy, cryptography, and adaptive data analysis. We give a general reduction transforming a dynamic algorithm against an oblivious adversary to a dynamic algorithm robust against an adaptive adversary. This reduction maintains several copies of the oblivious algorithm and uses differential privacy to protect their random bits. Using this reduction we obtain dynamic algorithms against an adaptive adversary with improved update and query times for global minimum cut, all pairs distances, and all pairs effective resistance. We further improve our update and query times by showing how to maintain a sparsifier over an expander decomposition that can be refreshed fast. This fast refresh enables it to be robust against what we call a blinking adversary that can observe the output of the algorithm only following refreshes. We believe that these techniques will prove useful for additional problems. On the flip side, we specify dynamic problems that, assuming a random oracle, every dynamic algorithm that solves them against an adaptive adversary must be polynomially slower than a rather straightforward dynamic algorithm that solves them against an oblivious adversary. We first show a separation result for a search problem and then show a separation result for an estimation problem. In the latter case our separation result draws from lower bounds in adaptive data analysis.
Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or to the large number of data points that … Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or to the large number of data points that is required for accurate results. We propose a simple and practical tool, $\mathsf{FriendlyCore}$, that takes a set of points ${\cal D}$ from an unrestricted (pseudo) metric space as input. When ${\cal D}$ has effective diameter $r$, $\mathsf{FriendlyCore}$ returns a "stable" subset ${\cal C} \subseteq {\cal D}$ that includes all points, except possibly few outliers, and is {\em certified} to have diameter $r$. $\mathsf{FriendlyCore}$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy. Surprisingly, $\mathsf{FriendlyCore}$ is light-weight with no dependence on the dimension. We empirically demonstrate its advantages in boosting the accuracy of mean estimation and clustering tasks such as $k$-means and $k$-GMM, outperforming tailored methods.
We give an $(\varepsilon,\delta)$-differentially private algorithm for the multi-armed bandit (MAB) problem in the shuffle model with a distribution-dependent regret of $O\left(\left(\sum_{a\in [k]:\Delta_a>0}\frac{\log T}{\Delta_a}\right)+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\varepsilon}\right)$, and a distribution-independent regret of … We give an $(\varepsilon,\delta)$-differentially private algorithm for the multi-armed bandit (MAB) problem in the shuffle model with a distribution-dependent regret of $O\left(\left(\sum_{a\in [k]:\Delta_a>0}\frac{\log T}{\Delta_a}\right)+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\varepsilon}\right)$, and a distribution-independent regret of $O\left(\sqrt{kT\log T}+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\varepsilon}\right)$, where $T$ is the number of rounds, $\Delta_a$ is the suboptimality gap of the arm $a$, and $k$ is the total number of arms. Our upper bound almost matches the regret of the best known algorithms for the centralized model, and significantly outperforms the best known algorithm in the local model.
We study the greedy-based online algorithm for edge-weighted matching with (one-sided) vertex arrivals in bipartite graphs, and edge arrivals in general graphs. This algorithm was first studied more than a … We study the greedy-based online algorithm for edge-weighted matching with (one-sided) vertex arrivals in bipartite graphs, and edge arrivals in general graphs. This algorithm was first studied more than a decade ago by Korula and P\'al for the bipartite case in the random-order model. While the weighted bipartite matching problem is solved in the random-order model, this is not the case in recent and exciting online models in which the online player is provided with a sample, and the arrival order is adversarial. The greedy-based algorithm is arguably the most natural and practical algorithm to be applied in these models. Despite its simplicity and appeal, and despite being studied in multiple works, the greedy-based algorithm was not fully understood in any of the studied online models, and its actual performance remained an open question for more than a decade. We provide a thorough analysis of the greedy-based algorithm in several online models. For vertex arrivals in bipartite graphs, we characterize the exact competitive-ratio of this algorithm in the random-order model, for any arrival order of the vertices subsequent to the sampling phase (adversarial and random orders in particular). We use it to derive tight analysis in the recent adversarial-order model with a sample (AOS model) for any sample size, providing the first result in this model beyond the simple secretary problem. Then, we generalize and strengthen the black box method of converting results in the random-order model to single-sample prophet inequalities, and use it to derive the state-of-the-art single-sample prophet inequality for the problem. Finally, we use our new techniques to analyze the greedy-based algorithm for edge arrivals in general graphs and derive results in all the mentioned online models. In this case as well, we improve upon the state-of-the-art single-sample prophet inequality.
The literature on data sanitization aims to design algorithms that take an input dataset and produce a privacy-preserving version of it, that captures some of its statistical properties. In this … The literature on data sanitization aims to design algorithms that take an input dataset and produce a privacy-preserving version of it, that captures some of its statistical properties. In this note we study this question from a streaming perspective and our goal is to sanitize a data stream. Specifically, we consider low-memory algorithms that operate on a data stream and produce an alternative privacy-preserving stream that captures some statistical properties of the original input stream.
We describe a new data structure for dynamic nearest neighbor queries in the plane with respect to a general family of distance functions. These include $L_p$-norms and additively weighted Euclidean … We describe a new data structure for dynamic nearest neighbor queries in the plane with respect to a general family of distance functions. These include $L_p$-norms and additively weighted Euclidean distances. Our data structure supports general (convex, pairwise disjoint) sites that have constant description complexity (e.g., points, line segments, disks, etc.). Our structure uses $O(n \log^3 n)$ storage, and requires polylogarithmic update and query time, improving an earlier data structure of Agarwal, Efrat and Sharir that required $O(n^\varepsilon)$ time for an update and $O(\log n)$ time for a query [SICOMP, 1999]. Our data structure has numerous applications. In all of them, it gives faster algorithms, typically reducing an $O(n^\varepsilon)$ factor in the previous bounds to polylogarithmic. In addition, we give here two new applications: an efficient construction of a spanner in a disk intersection graph, and a data structure for efficient connectivity queries in a dynamic disk graph.
Random sampling is used for several new geometric algorithms. The algorithms are "Las Vegas," and their expected bounds are with respect to the random behavior of the algorithms. One algorithm … Random sampling is used for several new geometric algorithms. The algorithms are "Las Vegas," and their expected bounds are with respect to the random behavior of the algorithms. One algorithm reports all the intersecting pairs of a set of line segments in the plane, and requires Ο(A + n log n) expected time, where A is the size of the answer, the number of intersecting pairs reported. The algorithm requires Ο(n) space in the worst case. Another algorithm computes the convex hull of a point set in E3 in Ο(n log A) expected time, where n is the number of points and A is the number of points on the surface of the hull. A simple Las Vegas algorithm triangulates simple polygons in Ο(n log log n) expected time. Algorithms for half-space range reporting are also given. In addition, this paper gives asymptotically tight bounds for a combinatorial quantity of interest in discrete and computational geometry, related to halfspace partitions of point sets.
We prove new upper and lower bounds on the sample complexity of (ε, δ) differentially private algorithms for releasing approximate answers to threshold functions. A threshold function c over a … We prove new upper and lower bounds on the sample complexity of (ε, δ) differentially private algorithms for releasing approximate answers to threshold functions. A threshold function c over a totally ordered domain X evaluates to c <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">z</sub> (y) = 1 if y ≤ x, and evaluates to 0 otherwise. We give the first nontrivial lower bound for releasing thresholds with (ε, δ) differential privacy, showing that the task is impossible over an infinite domain X, and moreover requires sample complexity n ≥ Ω(log* |X|), which grows with the size of the domain. Inspired by the techniques used to prove this lower bound, we give an algorithm for releasing thresholds with n ≤ 2 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(1+ο(1)) log*</sup> |X| samples. This improves the previous best upper bound of 8 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(1+ο(1)) log*</sup> |X| (Beimel et al., RANDOM '13). Our sample complexity upper and lower bounds also apply to the tasks of learning distributions with respect to Kolmogorov distance and of properly PAC learning thresholds with differential privacy. The lower bound gives the first separation between the sample complexity of properly learning a concept class with (ε, δ) differential privacy and learning without privacy. For properly learning thresholds in ℓ dimensions, this lower bound extends to n ≥ Ω(ℓ · log* |X|). To obtain our results, we give reductions in both directions from releasing and properly learning thresholds and the simpler interior point problem. Given a database D of elements from X, the interior point problem asks for an element between the smallest and largest elements in D. We introduce new recursive constructions for bounding the sample complexity of the interior point problem, as well as further reductions and techniques for proving impossibility results for other basic problems in differential privacy.
\indent This beautiful discipline emerged from number theory after the fruitful observation made by Minkowski (1896) that many important results in diophantine approximation (and in some other central fields of … \indent This beautiful discipline emerged from number theory after the fruitful observation made by Minkowski (1896) that many important results in diophantine approximation (and in some other central fields of number theory) can be established by easy geometric arguments.
This paper is a direct continuation of the corresponding paper [7], in which our main results were formulated in Section 3. Some of these results were proved and we prepared … This paper is a direct continuation of the corresponding paper [7], in which our main results were formulated in Section 3. Some of these results were proved and we prepared ourselves for the proof of the asymptotic normality of the sample sum. Our main concern in this part is to carry through the remaining proofs.
Many datasets, including market basket data, text or hypertext documents, and events recorded in different locations or time periods, can be modeled as a collection of sets over a ground … Many datasets, including market basket data, text or hypertext documents, and events recorded in different locations or time periods, can be modeled as a collection of sets over a ground set of keys. Common queries over such data, including similarity or association rules are represented as the weight or selectivity of keys that satisfy some selection predicate defined over keys' attributes and memberships in particular sets.
The best known upper bound on the number of topological changes in the Delaunay triangulation of a set of moving points in ℜ2 is (nearly) cubic, even if each point … The best known upper bound on the number of topological changes in the Delaunay triangulation of a set of moving points in ℜ2 is (nearly) cubic, even if each point is moving with a fixed velocity. We introduce the notion of a stable Delaunay graph (SDG in short), a dynamic subgraph of the Delaunay triangulation, that is less volatile in the sense that it undergoes fewer topological changes and yet retains many useful properties of the full Delaunay triangulation. SDG is defined in terms of a parameter ± > 0, and consists of Delaunay edges pq for which the (equal) angles at which p and q see the corresponding Voronoi edge epq are at least ±. We prove several interesting properties of SDG and describe two kinetic data structures for maintaining it. Both structures use O*(n) storage. They process O*(n2) events during the motion, each in O*(1) time, provided that the points of P move along algebraic trajectories of bounded degree; the O*(·) notation hides multiplicative factors that are polynomial in 1/± and polylogarithmic in n. The first structure is simpler but the dependency on 1/± in its performance is higher.
In this work we analyze the sample complexity of classification by differentially private algorithms. Differential privacy is a strong and well-studied notion of privacy introduced by Dwork et al. [Lecture … In this work we analyze the sample complexity of classification by differentially private algorithms. Differential privacy is a strong and well-studied notion of privacy introduced by Dwork et al. [Lecture Notes in Comput. Sci. 3876, Springer, New York, 2006, pp. 265--284] that ensures that the output of an algorithm leaks little information about the data point provided by any of the participating individuals. Sample complexity of private probably approximately correct (PAC) and agnostic learning was studied in a number of prior works starting with Kasiviswanathan et al. [SIAM J. Comput., 40 (2011), pp. 793--826]. However, a number of basic questions remain open [A. Beimel, S. P. Kasiviswanathan, and K. Nissim, Lecture Notes in Comput. Sci. 5978, Springer, New York, 2006, pp. 437--454; K. Chaudhuri and D. Hsu, Proceedings of Conference in Learning Theory, 2011, pp. 155--186; A. Beimel, K. Nissim, and U. Stemmer, Proceedings of the 4th Conference on Innovations in Theoretical Computer Science, 2013, pp. 97--110; A. Beimel, K. Nissim, and U. Stemmer, Proceedings of APPROX+RANDOM, 2013, pp. 363--378], most notably whether learning with privacy requires more samples than learning without privacy. We show that the sample complexity of learning with (pure) differential privacy can be arbitrarily higher than the sample complexity of learning without the privacy constraint or the sample complexity of learning with approximate differential privacy. Our second contribution and the main tool is an equivalence between the sample complexity of (pure) differentially private learning of a concept class $C$ ($\mathrm{SCDP}(C)$) and the randomized one-way communication complexity of the evaluation problem for concepts from $C$. Using this equivalence we prove the following bounds: (a) $\mathrm{SCDP}(C) = \Omega(\mathrm{LDim}(C))$, where $\mathrm{LDim}(C)$ is the Littlestone's dimension characterizing the number of mistakes in the online-mistake-bound learning model [N. Littlestone, Machine Learning, 2 (1987), pp. 285--318]. Known bounds on $\mathrm{LDim}(C)$ then imply that $\mathrm{SCDP}(C)$ can be much higher than the Vapnik--Chervonenkis dimension of $C$. (b) For any $t$, there exists a class $C$ such that $\mathrm{LDim}(C)=2$ but $\mathrm{SCDP}(C) \geq t$. (c) For any $t$, there exists a class $C$ such that the sample complexity of (pure) $\alpha$-differentially private PAC learning is $\Omega(t/\alpha)$ but the sample complexity of the approximate $(\alpha,\beta)$-differentially private PAC learning is $O(\log(1/\beta)/\alpha)$. This resolves an open problem from Beimel, Nissim, and Stemmer [Proceedings of APPROX+RANDOM, 2013, pp. 363--378].
We study the sample complexity of learning threshold functions under the constraint of differential privacy. It is assumed that each labeled example in the training data is the information of … We study the sample complexity of learning threshold functions under the constraint of differential privacy. It is assumed that each labeled example in the training data is the information of one individual and we would like to come up with a generalizing hypothesis $h$ while guaranteeing differential privacy for the individuals. Intuitively, this means that any single labeled example in the training data should not have a significant effect on the choice of the hypothesis. This problem has received much attention recently; unlike the non-private case, where the sample complexity is independent of the domain size and just depends on the desired accuracy and confidence, for private learning the sample complexity must depend on the domain size $X$ (even for approximate differential privacy). Alon et al. (STOC 2019) showed a lower bound of $Ω(\log^*|X|)$ on the sample complexity and Bun et al. (FOCS 2015) presented an approximate-private learner with sample complexity $\tilde{O}\left(2^{\log^*|X|}\right)$. In this work we reduce this gap significantly, almost settling the sample complexity. We first present a new upper bound (algorithm) of $\tilde{O}\left(\left(\log^*|X|\right)^2\right)$ on the sample complexity and then present an improved version with sample complexity $\tilde{O}\left(\left(\log^*|X|\right)^{1.5}\right)$. Our algorithm is constructed for the related interior point problem, where the goal is to find a point between the largest and smallest input elements. It is based on selecting an input-dependent hash function and using it to embed the database into a domain whose size is reduced logarithmically; this results in a new database, an interior point of which can be used to generate an interior point in the original database in a differentially private manner.
In this paper, we consider the dynamic Voronoi diagram problem. In this problem, a given set of planar points are moving and our objective is to find the Voronoi diagram … In this paper, we consider the dynamic Voronoi diagram problem. In this problem, a given set of planar points are moving and our objective is to find the Voronoi diagram of these moving points at any time t. A preprocessing algorithm and a query processing algorithm are presented in this paper. Assume that the points are in k-motion, and it takes O(k) time to find roots of a polynomial with degree O(k). The preprocessing algorithm takes O(k 2 n 3 log n · 2 O(α(n) 5k+1 ) ) time to process moving functions of given points, and uses O(k 2 n 3 2 O(α(n) 5k+1 ) ) space to store the preprocessing result where α(n) is the functional inverse of Ackermann's function. The query processing algorithm is designed to report the Voronoi diagram of these points for a query time t. It takes O(n) time which is optimal.
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the … A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses.
This Gödel Prize-winning work traces the steps toward modeling real data. This Gödel Prize-winning work traces the steps toward modeling real data.
Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time … Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at multiple locations, and records with multiple numeric attributes. Over such vector-weighted data we are interested in aggregates with respect to one set of weights, such as weighted sums, and aggregates over multiple sets of weights such as the L 1 difference. Sample-based summarization is highly effective for data sets that are too large to be stored or manipulated. The summary facilitates approximate processing queries that may be specified after the summary was generated. Current designs, however, are geared for data sets where a single scalar weight is associated with each key. We develop a sampling framework based on coordinated weighted samples that is suited for multiple weight assignments and obtain estimators that are orders of magnitude tighter than previously possible. We demonstrate the power of our methods through an extensive empirical evaluation on diverse data sets ranging from IP network to stock quotes data.
We investigate the adversarial robustness of streaming algorithms. In this context, an algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an … We investigate the adversarial robustness of streaming algorithms. In this context, an algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an adversary that observes the outputs of the algorithm along the stream and can react in an online manner. While deterministic streaming algorithms are inherently robust, many central problems in the streaming literature do not admit sublinear-space deterministic algorithms; on the other hand, classical space-efficient randomized algorithms for these problems are generally not adversarially robust. This raises the natural question of whether there exist efficient adversarially robust (randomized) streaming algorithms for these problems. In this work, we show that the answer is positive for various important streaming problems in the insertion-only model, including distinct elements and more generally $F_p$-estimation, Fp-heavy hitters, entropy estimation, and others. For all of these problems, we develop adversarially robust (1+ε)-approximation algorithms whose required space matches that of the best known non-robust algorithms up to a poly(log n, 1/ε) multiplicative factor (and in some cases even up to a constant factor). Towards this end, we develop several generic tools allowing one to efficiently transform a non-robust streaming algorithm into a robust one in various scenarios.
In this article, we show how to compute for n -vertex planar graphs in O ( n 11/6 polylog( n )) expected time the diameter and the sum of the … In this article, we show how to compute for n -vertex planar graphs in O ( n 11/6 polylog( n )) expected time the diameter and the sum of the pairwise distances. The algorithms work for directed graphs with real weights and no negative cycles. In O ( n 15/8 polylog( n )) expected time, we can also compute the number of pairs of vertices at distances smaller than a given threshold. These are the first algorithms for these problems using time O ( n c ) for some constant c &lt; 2, even when restricted to undirected, unweighted planar graphs.
We study the sample complexity of learning threshold functions under the constraint of differential privacy. It is assumed that each labeled example in the training data is the information of … We study the sample complexity of learning threshold functions under the constraint of differential privacy. It is assumed that each labeled example in the training data is the information of one individual and we would like to come up with a generalizing hypothesis $h$ while guaranteeing differential privacy for the individuals. Intuitively, this means that any single labeled example in the training data should not have a significant effect on the choice of the hypothesis. This problem has received much attention recently; unlike the non-private case, where the sample complexity is independent of the domain size and just depends on the desired accuracy and confidence, for private learning the sample complexity must depend on the domain size $X$ (even for approximate differential privacy). Alon et al. (STOC 2019) showed a lower bound of $\Omega(\log^*|X|)$ on the sample complexity and Bun et al. (FOCS 2015) presented an approximate-private learner with sample complexity $\tilde{O}\left(2^{\log^*|X|}\right)$. In this work we reduce this gap significantly, almost settling the sample complexity. We first present a new upper bound (algorithm) of $\tilde{O}\left(\left(\log^*|X|\right)^2\right)$ on the sample complexity and then present an improved version with sample complexity $\tilde{O}\left(\left(\log^*|X|\right)^{1.5}\right)$. Our algorithm is constructed for the related interior point problem, where the goal is to find a point between the largest and smallest input elements. It is based on selecting an input-dependent hash function and using it to embed the database into a domain whose size is reduced logarithmically; this results in a new database, an interior point of which can be used to generate an interior point in the original database in a differentially private manner.
As organizations struggle with vast amounts of data, outsourcing sensitive data to third parties becomes a necessity. To protect the data, various cryptographic techniques are used in outsourced database systems … As organizations struggle with vast amounts of data, outsourcing sensitive data to third parties becomes a necessity. To protect the data, various cryptographic techniques are used in outsourced database systems to ensure data privacy, while allowing efficient querying. Recent attacks on such systems demonstrate that outsourced database systems must trade-off efficiency and privacy. Towards designing systems that strike a good balance between these two aspects, we present a new model of differentially private outsourced database systems, where differential privacy is preserved at the record level even against an untrusted server that controls data and queries. Beginning with an atomic storage model where the server can observe both the memory access pattern and communication volume, we provide upper- and lower-bounds on the efficiency of differentially private outsourced database systems. Our lower-bounds motivate the examination of models where the memory access pattern is kept hidden from the server. Combining oblivious RAM with differentially private sanitizers, we present a generic construction of differentially private outsourced databases. We have implemented our constructions and report on their efficiency.
In his seminal work, Cleve [STOC '86] proved that the bias of any coin-flipping protocol is inversely proportional to the number of rounds. This lower bound was met for the … In his seminal work, Cleve [STOC '86] proved that the bias of any coin-flipping protocol is inversely proportional to the number of rounds. This lower bound was met for the two-party case by Moran et al. [Journal of Cryptology '16], and the three-party case (up to a polylogarithmic factor) by Haitner and Tsfadia [SICOMP '17], and was approached for multi-party protocols by Haitner et al. [SODA '17] when the number of rounds is at least doubly exponential in the number of parties. For the complement case, however, the best bias for multi-party coin-flipping protocols is proportional to the number of parties and inversely proportional to the square root of the number of rounds. The latter bias is achieved by the majority protocol of Awerbuch et al. [Manuscript '85]. Our main result is a tighter lower bound on the bias of coin-flipping protocols, showing that, if the number of rounds is bounded by some polynomial in the number of parties, then the bias is lower-bounded by a quantity that is inversely proportional to the square root of the number of rounds (up to a polylogarithmic factor). As far as we know, this is the first improvement of Cleve's bound, and is far from the aforementioned upper bound of Awerbuch et al. only by a factor of the number of parties. We prove the above bound using two new results that we believe are of independent interest. The first result is that a sequence of ("augmented") weak martingales have large gap: with constant probability there exists two adjacent variables whose gap is at least the ratio between the gap between the first and last variables and the square root of the number of variables. This generalizes over the result of Cleve and Impagliazzo [Manuscript '93], who showed that the above holds for strong martingales, and allows in some setting to exploit this gap by efficient algorithms. We prove the above using a novel argument that does not follow the more complicated approach of Cleve and Impagliazzo. The second result is a new sampling algorithm that uses a differentially private mechanism to minimize the effect of data divergence.
The Fréchet distance measures similarity between two curves $f$ and $g$ that takes into account the ordering of the points along the two curves: Informally, it is the minimum length … The Fréchet distance measures similarity between two curves $f$ and $g$ that takes into account the ordering of the points along the two curves: Informally, it is the minimum length of a leash required to connect a dog, walking along $f$, and its owner, walking along $g$, as they walk without backtracking along their respective curves from one endpoint to the other. The discrete Fréchet distance replaces the dog and its owner by a pair of frogs that can only reside on $m$ and $n$ specific stones, respectively. The stones are in fact sequences of points, typically sampled from the respective curves $f$ and $g$. These frogs hop from one stone to the next without backtracking, and the discrete Fréchet distance is the minimum length of a "leash" that connects the frogs and allows them to execute such a sequence of hops from the starting points to the terminal points of their sequences. The discrete Fréchet distance can be computed in $O(mn)$ time by a straightforward dynamic programming algorithm. We present the first subquadratic algorithm for computing the discrete Fréchet distance between two sequences of points in the plane. Assuming $m\le n$, the algorithm runs in $O(\frac{mn\log\log n}{\log n})$ time, in the word RAM model, using $O(n)$ storage. Our approach uses the geometry of the problem in a subtle way to encode legal positions of the frogs as states of a finite automaton.
We present a general purpose unequal probability without replacement sampling plan with fixed sample size. In contrast to existing such plans, our scheme keeps the sample size fixed and lets … We present a general purpose unequal probability without replacement sampling plan with fixed sample size. In contrast to existing such plans, our scheme keeps the sample size fixed and lets the population units enter the sample one at a time through a carefully designed random mechanism. Consequently, all high-order inclusion probabilities can be easily computed.
Learning problems form an important category of computational tasks that generalizes many of the computations researchers apply to large real-life data sets. We ask, What concept classes can be learned … Learning problems form an important category of computational tasks that generalizes many of the computations researchers apply to large real-life data sets. We ask, What concept classes can be learned privately, namely, by an algorithm whose output does not depend too heavily on any one input or specific training example? More precisely, we investigate learning algorithms that satisfy differential privacy, a notion that provides strong confidentiality guarantees in contexts where aggregate information is released about a database containing sensitive information about individuals. Our goal is a broad understanding of the resources required for private learning in terms of samples, computation time, and interaction. We demonstrate that, ignoring computational constraints, it is possible to privately agnostically learn any concept class using a sample size approximately logarithmic in the cardinality of the concept class. Therefore, almost anything learnable is learnable privately: specifically, if a concept class is learnable by a (nonprivate) algorithm with polynomial sample complexity and output size, then it can be learned privately using a polynomial number of samples. We also present a computationally efficient private probabilistically approximately correct learner for the class of parity functions. This result dispels the similarity between learning with noise and private learning (both must be robust to small changes in inputs), since parity is thought to be very hard to learn given random classification noise. Local (or randomized response) algorithms are a practical class of private algorithms that have received extensive investigation. We provide a precise characterization of local private learning algorithms. We show that a concept class is learnable by a local algorithm if and only if it is learnable in the statistical query (SQ) model. Therefore, for local private learning algorithms, the similarity to learning with noise is stronger: local learning is equivalent to SQ learning, and SQ algorithms include most known noise-tolerant learning algorithms. Finally, we present a separation between the power of interactive and noninteractive local learning algorithms. Because of the equivalence to SQ learning, this result also separates adaptive and nonadaptive SQ learning.
In the adversarially robust streaming model, a stream of elements is presented to an algorithm and is allowed to depend on the output of the algorithm at earlier times during … In the adversarially robust streaming model, a stream of elements is presented to an algorithm and is allowed to depend on the output of the algorithm at earlier times during the stream. In the classic insertion-only model of data streams, Ben-Eliezer et al. (PODS 2020, best paper award) show how to convert a non-robust algorithm into a robust one with a roughly <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$1/\varepsilon$</tex> factor overhead. This was subsequently improved to a <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$1/\sqrt{\varepsilon}$</tex> factor overhead by Hassidim et al. (NeurIPS 2020, oral presentation), suppressing logarithmic factors. For general functions the latter is known to be best-possible, by a result of Kaplan et al. (CRYPTO 2021). We show how to bypass this impossibility result by developing data stream algorithms for a large class of streaming problems, with no overhead in the approximation factor. Our class of streaming problems includes the most well-studied problems such as the <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$L_{2}$</tex> -heavy hitters problem, <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$F_{p}$</tex> -moment estimation, as well as empirical entropy estimation. We substantially improve upon all prior work on these problems, giving the first optimal dependence on the approximation factor. As in previous work, we obtain a general transformation that applies to any non-robust streaming algorithm and depends on the so-called flip number. However, the key technical innovation is that we apply the transformation to what we call a difference estimator for the streaming problem, rather than an estimator for the streaming prob-lem itself. We then develop the first difference estimators for a wide range of problems. Our difference estimator methodology is not only applicable to the adversarially ro-bust model, but to other streaming models where temporal properties of the data play a central role. To demonstrate the generality of our technique, we additionally introduce a general framework for the related sliding window model of data streams and resolve longstanding open questions in that model, obtaining a drastic improvement from the previous <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$1/\varepsilon^{2+p}$</tex> dependence for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$F_{p}$</tex> -moment estimation for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$p\in$</tex> [1], [2] and integer <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$p &gt; 2$</tex> of Braverman and Ostrovsky (FOCS, 2007), to the optimal <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$1/\varepsilon^{2}$</tex> bound. We also improve the prior <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$1/\varepsilon^{3}$</tex> bound for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$p\in[0,1)$</tex> , and the prior <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$1/-\varepsilon^{4}$</tex> bound for empirical entropy, obtaining the first optimal <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$1/\varepsilon^{2}$</tex> dependence for both of these problems as well. Qualitatively, our results show there is no separation between the sliding window model and the standard data stream model in terms of the approximation factor.
Propagation of contagion through networks is a fundamental process. It is used to model the spread of information, influence, or a viral infection. Diffusion patterns can be specified by a … Propagation of contagion through networks is a fundamental process. It is used to model the spread of information, influence, or a viral infection. Diffusion patterns can be specified by a probabilistic model, such as Independent Cascade (IC), or captured by a set of representative traces.
Motivated by applications to sensor, peer-to-peer, and ad-hoc networks, we study the problem of computing functions of values at the nodes in a network in a totally distributed manner. In … Motivated by applications to sensor, peer-to-peer, and ad-hoc networks, we study the problem of computing functions of values at the nodes in a network in a totally distributed manner. In particular, we consider separable functions, which can be written as linear combinations of functions of individual variables. Known iterative algorithms for averaging can be used to compute the normalized values of such functions, but these algorithms do not extend in general to the computation of the actual values of separable functions.The main contribution of this paper is the design of a distributed randomized algorithm for computing separable functions based on properties of exponential random variables. We bound the running time of our algorithm in terms of the running time of an information spreading algorithm used as a subroutine by the algorithm. Since we are interested in totally distributed algorithms, we consider a randomized gossip mechanism for information spreading as the subroutine. Combining these algorithms yields a complete and simple distributed algorithm for computing separable functions.The second contribution of this paper is an analysis of the information spreading time of the gossip algorithm. This analysis yields an upper bound on the information spreading time, and therefore a corresponding upper bound on the running time of the algorithm for computing separable functions, in terms of the conductance of an appropriate stochastic matrix. These bounds imply that, for a class of graphs with small spectral gap (such as grid graphs), the time used by our algorithm to compute averages is of a smaller order than the time required for the computation of averages by a known iterative gossip scheme [5].
In 2008, Kasiviswanathan el al. defined private learning as a combination of PAC learning and differential privacy [16]. Informally, a private learner is applied to a collection of labeled individual … In 2008, Kasiviswanathan el al. defined private learning as a combination of PAC learning and differential privacy [16]. Informally, a private learner is applied to a collection of labeled individual information and outputs a hypothesis while preserving the privacy of each individual. Kasiviswanathan et al. gave a generic construction of private learners for (finite) concept classes, with sample complexity logarithmic in the size of the concept class. This sample complexity is higher than what is needed for non-private learners, hence leaving open the possibility that the sample complexity of private learning may be sometimes significantly higher than that of non-private learning. We give a combinatorial characterization of the sample size sufficient and necessary to privately learn a class of concepts. This characterization is analogous to the well known characterization of the sample complexity of non-private learning in terms of the VC dimension of the concept class. We introduce the notion of probabilistic representation of a concept class, and our new complexity measure RepDim corresponds to the size of the smallest probabilistic representation of the concept class. We show that any private learning algorithm for a concept class C with sample complexity m implies RepDim(C) = O(m), and that there exists a private learning algorithm with sample complexity m = O(RepDim(C)).
We introduce fast algorithms for selecting a random sample of n records without replacement from a pool of N records, where the value of N is unknown beforehand. The main … We introduce fast algorithms for selecting a random sample of n records without replacement from a pool of N records, where the value of N is unknown beforehand. The main result of the paper is the design and analysis of Algorithm Z; it does the sampling in one pass using constant space and in O ( n (1 + log( N/n ))) expected time, which is optimum, up to a constant factor. Several optimizations are studied that collectively improve the speed of the naive version of the algorithm by an order of magnitude. We give an efficient Pascal-like implementation that incorporates these modifications and that is suitable for general use. Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin.
Given a triangulated planar graph G on n vertices and an integer r<n, an r--division of G with few holes is a decomposition of G into O(n/r) regions of size … Given a triangulated planar graph G on n vertices and an integer r<n, an r--division of G with few holes is a decomposition of G into O(n/r) regions of size at most r such that each region contains at most a constant number of faces that are not faces of G (also called holes), and such that, for each region, the total number of vertices on these faces is O(√ r). We provide an algorithm for computing r--divisions with few holes in linear time. In fact, our algorithm computes a structure, called decomposition tree, which represents a recursive decomposition of G that includes r--divisions for essentially all values of r. In particular, given an exponentially increasing sequence {vec r} = (r1,r2,...), our algorithm can produce a recursive {vec r}--division with few holes in linear time.
We show that, under a standard hardness assumption, there is no computationally efficient algorithm that given n samples from an unknown distribution can give valid answers to n3+o(1) adaptively chosen … We show that, under a standard hardness assumption, there is no computationally efficient algorithm that given n samples from an unknown distribution can give valid answers to n3+o(1) adaptively chosen statistical queries. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is valid if it is "close" to the correct expectation over the distribution. Our result stands in stark contrast to the well known fact that exponentially many statistical queries can be answered validly and efficiently if the queries are chosen non-adaptively (no query may depend on the answers to previous queries). Moreover, Dwork et al. [1], showed how to accurately answer exponentially many adaptively chosen statistical queries via a computationally inefficient algorithm. They also gave efficient algorithm that can answer nearly n2 adaptively chosen queries, which shows our result is almost quantitatively tight. Conceptually, our result demonstrates that achieving statistical validity alone can be a source of computational intractability in adaptive settings. For example, in the modern large collaborative research environment, data analysts typically choose a particular approach based on previous findings. False discovery occurs if a research finding is supported by the data but not by the underlying distribution. While the study of preventing false discovery in Statistics is decades old, to the best of our knowledge our result is the first to demonstrate a computational barrier. In particular, our result suggests that the perceived difficulty of preventing false discovery in today's collaborative research environment may be inherent.
We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to … We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.
which is the precision to which F approximates K. An instance of classical interest is that in which L is C([O, 1]), K is AO-i.e. those functions bounded by one, … which is the precision to which F approximates K. An instance of classical interest is that in which L is C([O, 1]), K is AO-i.e. those functions bounded by one, and whose modulus of continuity is dominated by the modulus of continuity function w-and F is the family P,, of polynomials of degree n-1. For this case it is known that there are positive constants A and B, independent of w and n, such that
Distance queries are a basic tool in data analysis. They are used for detection and localization of change for the purpose of anomaly detection, monitoring, or planning. Distance queries are … Distance queries are a basic tool in data analysis. They are used for detection and localization of change for the purpose of anomaly detection, monitoring, or planning. Distance queries are particularly useful when data sets such as measurements, snapshots of a system, content, traffic matrices, and activity logs are collected repeatedly. Random sampling, which can be efficiently performed over streamed or distributed data, is an important tool for scalable data analysis. The sample constitutes an extremely flexible summary, which naturally supports domain queries and scalable estimation of statistics, which can be specified after the sample is generated. The effectiveness of a sample as a summary, however, hinges on the estimators we have. We derive novel estimators for estimating $L_p$ distance from sampled data. Our estimators apply with the most common weighted sampling schemes: Poisson Probability Proportional to Size (PPS) and its fixed sample size variants. They also apply when the samples of different data sets are independent or coordinated. Our estimators are admissible (Pareto optimal in terms of variance) and have compelling properties. We study the performance of our Manhattan and Euclidean distance ($p=1,2$) estimators on diverse datasets, demonstrating scalability and accuracy even when a small fraction of the data is sampled. Our work, for the first time, facilitates effective distance estimation over sampled data.