# Simons Investigator Award

2020
Doron, Dean, Jack Murtagh, Salil Vadhan, and David Zuckerman. Spectral sparsification via bounded-independence sampling. Electronic Colloquium on Computational Complexity (ECCC), TR20-026, 2020. Publisher's VersionAbstract

Version History:

v1, 26 Feb 2020: https://arxiv.org/abs/2002.11237

We give a deterministic, nearly logarithmic-space algorithm for mild spectral sparsification of undirected graphs. Given a weighted, undirected graph $$G$$ on $$n$$ vertices described by a binary string of length $$N$$, an integer $$k \leq \log n$$ and an error parameter $$\varepsilon > 0$$, our algorithm runs in space $$\tilde{O}(k \log(N ^. w_{max}/w_{min}))$$ where $$w_{max}$$ and $$w_{min}$$ are the maximum and minimum edge weights in $$G$$, and produces a weighted graph $$H$$ with $$\tilde{O}(n^{1+2/k} / \varepsilon^2)$$expected edges that spectrally approximates $$G$$, in the sense of Spielmen and Teng [ST04], up to an error of $$\varepsilon$$.

Our algorithm is based on a new bounded-independence analysis of Spielman and Srivastava's effective resistance based edge sampling algorithm [SS08] and uses results from recent work on space-bounded Laplacian solvers [MRSV17]. In particular, we demonstrate an inherent tradeoff (via upper and lower bounds) between the amount of (bounded) independence used in the edge sampling algorithm, denoted by $$k$$ above, and the resulting sparsity that can be achieved.

Chen, Yiling, Or Sheffet, and Salil Vadhan. “Privacy games.” ACM Transactions on Economics and Computation 8, no. 2 (2020): Article 9. Publisher's VersionAbstract

Version History:

Previously published as: Yiling Chen, Or Sheffet, and Salil Vadhan. Privacy games. In Proceedings of the 10th International Conference on Web and Internet Economics (WINE ‘14), volume 8877 of Lecture Notes in Computer Science, pages 371–385. Springer-Verlag, 14–17 December 2014. (WINE Publisher's Version linked here: https://link.springer.com/chapter/10.1007/978-3-319-13129-0_30); PDF attached as WINE2014.

The problem of analyzing the effect of privacy concerns on the behavior of selfish utility-maximizing agents has received much attention lately. Privacy concerns are often modeled by altering the utility functions of agents to consider also their privacy loss. Such privacy aware agents prefer to take a randomized strategy even in very simple games in which non-privacy aware agents play pure strategies. In some cases, the behavior of privacy aware agents follows the framework of Randomized Response, a well-known mechanism that preserves differential privacy.

Our work is aimed at better understanding the behavior of agents in settings where their privacy concerns are explicitly given. We consider a toy setting where agent A, in an attempt to discover the secret type of agent B, offers B a gift that one type of B agent likes and the other type dislikes. As opposed to previous works, B's incentive to keep her type a secret isn't the result of "hardwiring" B's utility function to consider privacy, but rather takes the form of a payment between B and A. We investigate three different types of payment functions and analyze B's behavior in each of the resulting games. As we show, under some payments, B's behavior is very different than the behavior of agents with hardwired privacy concerns and might even be deterministic. Under a different payment we show that B's BNE strategy does fall into the framework of Randomized Response.

2019
Ahmadinejad, AmirMahdi, Jonathan Kelner, Jack Murtagh, John Peebles, Aaron Sidford, and Salil Vadhan. “High-precision estimation of random walks in small space.” arXiv: 1912.04525 [cs.CC], 2019 (2019). ArXiv VersionAbstract
In this paper, we provide a deterministic $$\tilde{O}(\log N)$$-space algorithm for estimating the random walk probabilities on Eulerian directed graphs (and thus also undirected graphs) to within inverse polynomial additive error $$(ϵ = 1/\mathrm{poly}(N))$$ where $$N$$ is the length of the input. Previously, this problem was known to be solvable by a randomized algorithm using space $$O (\log N)$$ (Aleliunas et al., FOCS '79) and by a deterministic algorithm using space $$O (\log^{3/2} N)$$ (Saks and Zhou, FOCS '95 and JCSS '99), both of which held for arbitrary directed graphs but had not been improved even for undirected graphs. We also give improvements on the space complexity of both of these previous algorithms for non-Eulerian directed graphs when the error is negligible $$(ϵ=1/N^{ω(1)})$$, generalizing what Hoza and Zuckerman (FOCS '18) recently showed for the special case of distinguishing whether a random walk probability is 0 or greater than ϵ.
We achieve these results by giving new reductions between powering Eulerian random-walk matrices and inverting Eulerian Laplacian matrices, providing a new notion of spectral approximation for Eulerian graphs that is preserved under powering, and giving the first deterministic $$\tilde{O}(\log N)$$-space algorithm for inverting Eulerian Laplacian matrices. The latter algorithm builds on the work of Murtagh et al. (FOCS '17) that gave a deterministic $$\tilde{O}(\log N)$$-space algorithm for inverting undirected Laplacian matrices, and the work of Cohen et al. (FOCS '19) that gave a randomized $$\tilde{O} (N)$$-time algorithm for inverting Eulerian Laplacian matrices. A running theme throughout these contributions is an analysis of "cycle-lifted graphs," where we take a graph and "lift" it to a new graph whose adjacency matrix is the tensor product of the original adjacency matrix and a directed cycle (or variants of one).
Balcer, Victor, and Salil Vadhan. “Differential privacy on finite computers.” Journal of Privacy and Confidentiality 9, no. 2 (2019). Publisher's VersionAbstract

Version History:

Also presented at TPDP 2017; preliminary version posted as arXiv:1709.05396 [cs.DS].

2018: Published in Anna R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94 of Leibniz International Proceedings in Informatics (LIPIcs), pp 43:1-43:21. http://drops.dagstuhl.de/opus/frontdoor.php?source_opus=8353

We consider the problem of designing and analyzing differentially private algorithms that can be implemented on discrete models of computation in strict polynomial time, motivated by known attacks on floating point implementations of real-arithmetic differentially private algorithms (Mironov, CCS 2012) and the potential for timing attacks on expected polynomial-time algorithms. As a case study, we examine the basic problem of approximating the histogram of a categorical dataset over a possibly large data universe $$X$$. The classic Laplace Mechanism (Dwork, McSherry, Nissim, Smith, TCC 2006 and J. Privacy & Confidentiality 2017) does not satisfy our requirements, as it is based on real arithmetic, and natural discrete analogues, such as the Geometric Mechanism (Ghosh, Roughgarden, Sundarajan, STOC 2009 and SICOMP 2012), take time at least linear in $$|X|$$, which can be exponential in the bit length of the input.

In this paper, we provide strict polynomial-time discrete algorithms for approximate histograms whose simultaneous accuracy (the maximum error over all bins) matches that of the Laplace Mechanism up to constant factors, while retaining the same (pure) differential privacy guarantee. One of our algorithms produces a sparse histogram as output. Its “per-bin accuracy” (the error on individual bins) is worse than that of the Laplace Mechanism by a factor of $$\log |X|$$, but we prove a lower bound showing that this is necessary for any algorithm that produces a sparse histogram. A second algorithm avoids this lower bound, and matches the per-bin accuracy of the Laplace Mechanism, by producing a compact and efficiently computable representation of a dense histogram; it is based on an $$(n + 1)$$-wise independent implementation of an appropriately clamped version of the Discrete Geometric Mechanism.

2018
Bun, Mark, Jonathan Ullman, and Salil Vadhan. “Fingerprinting codes and the price of approximate differential privacy.” SIAM Journal on Computing, Special Issue on STOC '14 47, no. 5 (2018): 1888-1938. Publisher's VersionAbstract

Version HistorySpecial Issue on STOC ‘14. Preliminary versions in STOC ‘14 and arXiv:1311.3158 [cs.CR].

We show new information-theoretic lower bounds on the sample complexity of (ε, δ)- differentially private algorithms that accurately answer large sets of counting queries. A counting query on a database $$D ∈ (\{0, 1\}^d)^n$$ has the form “What fraction of the individual records in the database satisfy the property $$q$$?” We show that in order to answer an arbitrary set $$Q$$ of $$\gg d/ \alpha^2$$ counting queries on $$D$$ to within error $$±α$$ it is necessary that $$n ≥ \tilde{Ω}(\sqrt{d} \log |Q|/α^2ε)$$. This bound is optimal up to polylogarithmic factors, as demonstrated by the private multiplicative weights algorithm (Hardt and Rothblum, FOCS’10). In particular, our lower bound is the first to show that the sample complexity required for accuracy and (ε, δ)-differential privacy is asymptotically larger than what is required merely for accuracy, which is $$O(\log |Q|/α^2 )$$. In addition, we show that our lower bound holds for the specific case of $$k$$-way marginal queries (where $$|Q| = 2^k \binom{d}{k}$$ ) when $$\alpha$$ is not too small compared to d (e.g., when $$\alpha$$ is any fixed constant). Our results rely on the existence of short fingerprinting codes (Boneh and Shaw, CRYPTO’95; Tardos, STOC’03), which we show are closely connected to the sample complexity of differentially private data release. We also give a new method for combining certain types of sample-complexity lower bounds into stronger lower bounds.

Murtagh, Jack, and Salil Vadhan. “The complexity of computing the optimal composition of differential privacy.” Theory of Computing 14 (2018): 1-35. Publisher's VersionAbstract

Version History: Full version posted on CoRR, abs/1507.03113, July 2015Additional version published in Proceedings of the 13th IACR Theory of Cryptography Conference (TCC '16-A)

In the study of differential privacy, composition theorems (starting with the original paper of Dwork, McSherry, Nissim, and Smith (TCC '06)) bound the degradation of privacy when composing several differentially private algorithms. Kairouz, Oh, and Viswanath (ICML '15) showed how to compute the optimal bound for composing $$k$$ arbitrary ($$\epsilon$$,$$\delta$$)- differentially private algorithms. We characterize the optimal composition for the more general case of $$k$$ arbitrary ($$\epsilon_1$$ , $$\delta_1$$ ), . . . , ($$\epsilon_k$$ , $$\delta_k$$ )-differentially private algorithms where the privacy parameters may differ for each algorithm in the composition. We show that computing the optimal composition in general is $$\#$$P-complete. Since computing optimal composition exactly is infeasible (unless FP$$=$$$$\#$$P), we give an approximation algorithm that computes the composition to arbitrary accuracy in polynomial time. The algorithm is a modification of Dyer’s dynamic programming approach to approximately counting solutions to knapsack problems (STOC '03).

Karwa, Vishesh, and Salil Vadhan. “Finite sample differentially private confidence intervals.” In Anna R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94 of Leibniz International Proceedings in Informatics (LIPIcs), 44:1-44:9. Dagstuhl, Germany, 2018. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. ITCS, 2018. Publisher's VersionAbstract

Version History: Also presented at TPDP 2017. Preliminary version posted as arXiv:1711.03908 [cs.CR].

We study the problem of estimating finite sample confidence intervals of the mean of a normal population under the constraint of differential privacy. We consider both the known and unknown variance cases and construct differentially private algorithms to estimate confidence intervals. Crucially, our algorithms guarantee a finite sample coverage, as opposed to an asymptotic coverage. Unlike most previous differentially private algorithms, we do not require the domain of the samples to be bounded. We also prove lower bounds on the expected size of any differentially private confidence set showing that our the parameters are optimal up to polylogarithmic factors.

2017
Haitner, Iftach, and Salil Vadhan. “The Many Entropies in One-way Functions.” In Tutorials on the Foundations of Cryptography, 159-217. Springer, Yehuda Lindell, ed. 2017. Publisher's VersionAbstract

Version History:

Earlier versions: May 2017: ECCC TR 17-084

Dec. 2017: ECCC TR 17-084 (revised)

Computational analogues of information-theoretic notions have given rise to some of the most interesting phenomena in the theory of computation. For example, computational indistinguishability, Goldwasser and Micali [9], which is the computational analogue of statistical distance, enabled the bypassing of Shannon’s impossibility results on perfectly secure encryption, and provided the basis for the computational theory of pseudorandomness. Pseudoentropy, Håstad, Impagliazzo, Levin, and Luby [17], a computational analogue of entropy, was the key to the fundamental result establishing the equivalence of pseudorandom generators and one-way functions, and has become a basic concept in complexity theory and cryptography.

This tutorial discusses two rather recent computational notions of entropy, both of which can be easily found in any one-way function, the most basic cryptographic primitive. The first notion is next-block pseudoentropy, Haitner, Reingold, and Vadhan [14], a refinement of pseudoentropy that enables simpler and more ecient construction of pseudorandom generators. The second is inaccessible entropy, Haitner, Reingold, Vadhan, andWee [11], which relates to unforgeability and is used to construct simpler and more efficient universal one-way hash functions and statistically hiding commitments.

Vadhan, Salil. “The Complexity of Differential Privacy.” In Tutorials on the Foundations of Cryptography, 347-450. Springer, Yehuda Lindell, ed. 2017. Publisher's VersionAbstract

Version History:

August 2016: Manuscript v1 (see files attached)

March 2017: Manuscript v2 (see files attached); Errata

April 2017: Published Version (in Tutorials on the Foundations of Cryptography; see Publisher's Version link and also SPRINGER 2017.PDF, below)

Differential privacy is a theoretical framework for ensuring the privacy of individual-level data when performing statistical analysis of privacy-sensitive datasets. This tutorial provides an introduction to and overview of differential privacy, with the goal of conveying its deep connections to a variety of other topics in computational complexity, cryptography, and theoretical computer science at large. This tutorial is written in celebration of Oded Goldreich’s 60th birthday, starting from notes taken during a minicourse given by the author and Kunal Talwar at the 26th McGill Invitational Workshop on Computational Complexity [1].

Steinke, Thomas, Salil Vadhan, and Andrew Wan. “Pseudorandomness and Fourier growth bounds for width 3 branching programs.” Theory of Computing – Special Issue on APPROX-RANDOM 2014 13, no. 12 (2017): 1-50. Publisher's VersionAbstract

Version History: a conference version of this paper appeared in the Proceedings of the 18th International Workshop on Randomization and Computation (RANDOM'14). Full version posted as ECCC TR14-076 and arXiv:1405.7028 [cs.CC].

We present an explicit pseudorandom generator for oblivious, read-once, width-3 branching programs, which can read their input bits in any order. The generator has seed length $$Õ(\log^3 n)$$.The previously best known seed length for this model is $$n^{1/2+o(1)}$$ due to Impagliazzo, Meka, and Zuckerman (FOCS ’12). Our work generalizes a recent result of Reingold, Steinke, and Vadhan (RANDOM ’13) for permutation branching programs. The main technical novelty underlying our generator is a new bound on the Fourier growth of width-3, oblivious, read-once branching programs. Specifically, we show that for any $$f : \{0, 1\}^n → \{0, 1\}$$ computed by such a branching program, and $$k ∈ [n]$$,

$$\displaystyle\sum_{s⊆[n]:|s|=k} \big| \hat{f}[s] \big | ≤n^2 ·(O(\log n))^k$$,

where $$\hat{f}[s] = \mathbb{E}_U [f[U] \cdot (-1)^{s \cdot U}]$$ is the standard Fourier transform over $$\mathbb{Z}^n_2$$. The base $$O(\log n)$$ of the Fourier growth is tight up to a factor of $$\log \log n$$.

2016
Nissim, Kobbi, Uri Stemmer, and Salil Vadhan. “Locating a small cluster privately.” In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS ‘16), 413-427. ACM, 2016. Publisher's VersionAbstract

Version HistoryFull version posted as arXiv:1604.05590 [cs.DS].

We present a new algorithm for locating a small cluster of points with differential privacy [Dwork, McSherry, Nissim, and Smith, 2006]. Our algorithm has implications to private data exploration, clustering, and removal of outliers. Furthermore, we use it to significantly relax the requirements of the sample and aggregate technique [Nissim, Raskhodnikova, and Smith, 2007], which allows compiling of “off the shelf” (non-private) analyses into analyses that preserve differential privacy.

Gaboardi, Marco, Hyun Woo Lim, Ryan Rogers, and Salil Vadhan. “Differentially private chi-squared hypothesis testing: Goodness of fit and independence testing.” In M. Balcan and K. Weinberger, editors, Proceedings of the 33rd International Conference on Machine Learning (ICML ‘16). 2111-2120, 2016. Publisher's VersionAbstract

Version History: Preliminary version posted as arXiv:1602.03090.

Hypothesis testing is a useful statistical tool in determining whether a given model should be rejected based on a sample from the population. Sample data may contain sensitive information about individuals, such as medical information. Thus it is important to design statistical tests that guarantee the privacy of subjects in the data. In this work, we study hypothesis testing subject to differential privacy, specifically chi-squared tests for goodness of fit for multinomial data and independence between two categorical variables.

We propose new tests for goodness of fit and independence testing that like the classical versions can be used to determine whether a given model should be rejected or not, and that additionally can ensure differential privacy. We give both Monte Carlo based hypothesis tests as well as hypothesis tests that more closely follow the classical chi-squared goodness of fit test and the Pearson chi-squared test for independence. Crucially, our tests account for the distribution of the noise that is injected to ensure privacy in determining significance.

We show that these tests can be used to achieve desired significance levels, in sharp contrast to direct applications of classical tests to differentially private contingency tables which can result in wildly varying significance levels. Moreover, we study the statistical power of these tests. We empirically show that to achieve the same level of power as the classical non-private tests our new tests need only a relatively modest increase in sample size.

Gaboardi, Marco, James Honaker, Gary King, Jack Murtagh, Kobbi Nissim, Jonathan Ullman, and Salil Vadhan. “PSI (Ψ): a private data-sharing interface.” In Poster presentation at the 2nd Workshop on the Theory and Practice of Differential Privacy (TPDP ‘16), 2016. ArXiv VersionAbstract

Version History: Paper posted as arXiv:1609.04340 [cs.CR].

We provide an overview of the design of PSI (“a Private data Sharing Interface”), a system we are developing to enable researchers in the social sciences and other fields to share and explore privacy-sensitive datasets with the strong privacy protections of differential privacy.

Rogers, Ryan, Aaron Roth, Jonathan Ullman, and Salil Vadhan. “Privacy odometers and filters: Pay-as-you-go composition.” In Advances in Neural Information Processing Systems 29 (NIPS 16). 1921-1929, 2016. Publisher's VersionAbstract

Version History: Full version posted as https://arxiv.org/abs/1605.08294.

In this paper we initiate the study of adaptive composition in differential privacy when the length of the composition, and the privacy parameters themselves can be chosen adaptively, as a function of the outcome of previously run analyses. This case is much more delicate than the setting covered by existing composition theorems, in which the algorithms themselves can be chosen adaptively, but the privacy parameters must be fixed up front. Indeed, it isn’t even clear how to define differential privacy in the adaptive parameter setting. We proceed by defining two objects which cover the two main use cases of composition theorems. A privacy filter is a stopping time rule that allows an analyst to halt a computation before his pre-specified privacy budget is exceeded. A privacy odometer allows the analyst to track realized privacy loss as he goes, without needing to pre-specify a privacy budget. We show that unlike the case in which privacy parameters are fixed, in the adaptive parameter setting, these two use cases are distinct. We show that there exist privacy filters with bounds comparable (up to constants) with existing pri- vacy composition theorems. We also give a privacy odometer that nearly matches non-adaptive private composition theorems, but is sometimes worse by a small asymptotic factor. Moreover, we show that this is inherent, and that any valid privacy odometer in the adaptive parameter setting must lose this factor, which shows a formal separation between the filter and odometer use-cases.

Bun, Mark, Yi-Hsiu Chen, and Salil Vadhan. “Separating computational and statistical differential privacy in the client-server model.” In Martin Hirt and Adam D. Smith, editors, Proceedings of the 14th IACR Theory of Cryptography Conference (TCC 16-B). Lecture Notes in Computer Science. Springer Verlag, 31 October-3 November, 2016. Publisher's VersionAbstract

Version History: Full version posted on Cryptology ePrint Archive, Report 2016/820.

Differential privacy is a mathematical definition of privacy for statistical data analysis. It guarantees that any (possibly adversarial) data analyst is unable to learn too much information that is specific to an individual. Mironov et al. (CRYPTO 2009) proposed several computa- tional relaxations of differential privacy (CDP), which relax this guarantee to hold only against computationally bounded adversaries. Their work and subsequent work showed that CDP can yield substantial accuracy improvements in various multiparty privacy problems. However, these works left open whether such improvements are possible in the traditional client-server model of data analysis. In fact, Groce, Katz and Yerukhimovich (TCC 2011) showed that, in this setting, it is impossible to take advantage of CDP for many natural statistical tasks.

Our main result shows that, assuming the existence of sub-exponentially secure one-way functions and 2-message witness indistinguishable proofs (zaps) for NP, that there is in fact a computational task in the client-server model that can be efficiently performed with CDP, but is infeasible to perform with information-theoretic differential privacy.

2015
Bun, Mark, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. “Differentially private release and learning of threshold functions.” In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘15). IEEE, 2015. Publisher's VersionAbstract

Version HistoryFull version posted as arXiv:1504.07553.

We prove new upper and lower bounds on the sample complexity of $$(\varepsilon, \delta)$$ differentially private algorithms for releasing approximate answers to threshold functions. A threshold function $$c_x$$ over a totally ordered domain $$X$$ evaluates to $$c_x(y)=1$$ if $$y \leq {x}$$, and evaluates to $$0$$ otherwise. We give the first nontrivial lower bound for releasing thresholds with $$(\varepsilon, \delta)$$ differential privacy, showing that the task is impossible over an infinite domain $$X$$, and moreover requires sample complexity $$n \geq \Omega(\log^* |X|)$$, which grows with the size of the domain. Inspired by the techniques used to prove this lower bound, we give an algorithm for releasing thresholds with $$n ≤ 2^{(1+o(1)) \log^∗|X|}$$ samples. This improves the previous best upper bound of $$8^{(1+o(1)) \log^∗ |X|}$$(Beimel et al., RANDOM ’13).

Our sample complexity upper and lower bounds also apply to the tasks of learning distri- butions with respect to Kolmogorov distance and of properly PAC learning thresholds with differential privacy. The lower bound gives the first separation between the sample complexity of properly learning a concept class with $$(\varepsilon, \delta)$$ differential privacy and learning without privacy. For properly learning thresholds in $$\ell$$ dimensions, this lower bound extends to $$n ≥ Ω(\ell·\log^∗ |X|)$$.

To obtain our results, we give reductions in both directions from releasing and properly learning thresholds and the simpler interior point problem. Given a database $$D$$ of elements from $$X$$, the interior point problem asks for an element between the smallest and largest elements in $$D$$. We introduce new recursive constructions for bounding the sample complexity of the interior point problem, as well as further reductions and techniques for proving impossibility results for other basic problems in differential privacy.

Chen, Sitan, Thomas Steinke, and Salil P. Vadhan. “Pseudorandomness for read-once, constant-depth circuits.” CoRR, 2015, 1504.04675. Publisher's VersionAbstract

For Boolean functions computed by read-once, depth-D circuits with unbounded fan-in over the de Morgan basis, we present an explicit pseudorandom generator with seed length $$\tilde{O}(\log^{D+1} n)$$. The previous best seed length known for this model was $$\tilde{O}(\log^{D+4} n)$$, obtained by Trevisan and Xue (CCC ‘13) for all of AC0 (not just read-once). Our work makes use of Fourier analytic techniques for pseudorandomness introduced by Reingold, Steinke, and Vadhan (RANDOM ‘13) to show that the generator of Gopalan et al. (FOCS ‘12) fools read-once AC0. To this end, we prove a new Fourier growth bound for read-once circuits, namely that for every $$F : \{0,1\}^n\rightarrow \{0,1\}$$ computed by a read-once, depth-$$D$$ circuit,

$$\left|\hat{F}[s]\right| \leq O\left(\log^{D-1} n\right)^k,$$

where $$\hat{F}$$ denotes the Fourier transform of $$F$$ over $$\mathbb{Z}_2^n$$.

Dwork, Cynthia, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. “Robust traceability from trace amounts.” In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘15), 650-669. IEEE, 2015. Publisher's VersionAbstract

The privacy risks inherent in the release of a large number of summary statistics were illustrated by Homer et al. (PLoS Genetics, 2008), who considered the case of 1-way marginals of SNP allele frequencies obtained in a genome-wide association study: Given a large number of minor allele frequencies from a case group of individuals diagnosed with a particular disease, together with the genomic data of a single target individual and statistics from a sizable reference dataset independently drawn from the same population, an attacker can determine with high confidence whether or not the target is in the case group.

In this work we describe and analyze a simple attack that succeeds even if the summary statistics are significantly distorted, whether due to measurement error or noise intentionally introduced to protect privacy. Our attack only requires that the vector of distorted summary statistics is close to the vector of true marginals in $$\ell_1$$norm. Moreover, the reference pool required by previous attacks can be replaced by a single sample drawn from the underlying population.

The new attack, which is not specific to genomics and which handles Gaussian as well as Bernouilli data, significantly generalizes recent lower bounds on the noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014; Steinke and Ullman, 2015), obviating the need for the attacker to control the exact distribution of the data.