# Statistics & ML

Alabi, Daniel, Audra McMillan, Jayshree Sarathy, Adam Smith, and Salil Vadhan. “Differentially private simple linear regression.” arXiv: 2007.05157 [cs.LG] (2020). Publisher's VersionAbstract
Economics and social science research often require analyzing datasets of sensitive personal information at fine granularity, with models fit to small subsets of the data. Unfortunately, such fine-grained analysis can easily reveal sensitive individual information. We study algorithms for simple linear regression that satisfy differential privacy, a constraint which guarantees that an algorithm's output reveals little about any individual input data record, even to an attacker with arbitrary side information about the dataset. We consider the design of differentially private algorithms for simple linear regression for small datasets, with tens to hundreds of datapoints, which is a particularly challenging regime for differential privacy. Focusing on a particular application to small-area analysis in economics research, we study the performance of a spectrum of algorithms we adapt to the setting. We identify key factors that affect their performance, showing through a range of experiments that algorithms based on robust estimators (in particular, the Theil-Sen estimator) perform well on the smallest datasets, but that other more standard algorithms do better as the dataset size increases.
Bun, Mark, Jonathan Ullman, and Salil Vadhan. “Fingerprinting codes and the price of approximate differential privacy.” SIAM Journal on Computing, Special Issue on STOC '14 47, no. 5 (2018): 1888-1938. Publisher's VersionAbstract

Version HistorySpecial Issue on STOC ‘14. Preliminary versions in STOC ‘14 and arXiv:1311.3158 [cs.CR].

We show new information-theoretic lower bounds on the sample complexity of (ε, δ)- differentially private algorithms that accurately answer large sets of counting queries. A counting query on a database $$D ∈ (\{0, 1\}^d)^n$$ has the form “What fraction of the individual records in the database satisfy the property $$q$$?” We show that in order to answer an arbitrary set $$Q$$ of $$\gg d/ \alpha^2$$ counting queries on $$D$$ to within error $$±α$$ it is necessary that $$n ≥ \tilde{Ω}(\sqrt{d} \log |Q|/α^2ε)$$. This bound is optimal up to polylogarithmic factors, as demonstrated by the private multiplicative weights algorithm (Hardt and Rothblum, FOCS’10). In particular, our lower bound is the first to show that the sample complexity required for accuracy and (ε, δ)-differential privacy is asymptotically larger than what is required merely for accuracy, which is $$O(\log |Q|/α^2 )$$. In addition, we show that our lower bound holds for the specific case of $$k$$-way marginal queries (where $$|Q| = 2^k \binom{d}{k}$$ ) when $$\alpha$$ is not too small compared to d (e.g., when $$\alpha$$ is any fixed constant). Our results rely on the existence of short fingerprinting codes (Boneh and Shaw, CRYPTO’95; Tardos, STOC’03), which we show are closely connected to the sample complexity of differentially private data release. We also give a new method for combining certain types of sample-complexity lower bounds into stronger lower bounds.

Karwa, Vishesh, and Salil Vadhan. “Finite sample differentially private confidence intervals.” In Anna R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94 of Leibniz International Proceedings in Informatics (LIPIcs), 44:1-44:9. Dagstuhl, Germany, 2018. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. ITCS, 2018. Publisher's VersionAbstract

Version History: Also presented at TPDP 2017. Preliminary version posted as arXiv:1711.03908 [cs.CR].

We study the problem of estimating finite sample confidence intervals of the mean of a normal population under the constraint of differential privacy. We consider both the known and unknown variance cases and construct differentially private algorithms to estimate confidence intervals. Crucially, our algorithms guarantee a finite sample coverage, as opposed to an asymptotic coverage. Unlike most previous differentially private algorithms, we do not require the domain of the samples to be bounded. We also prove lower bounds on the expected size of any differentially private confidence set showing that our the parameters are optimal up to polylogarithmic factors.

Vadhan, Salil. “The Complexity of Differential Privacy.” In Tutorials on the Foundations of Cryptography, 347-450. Springer, Yehuda Lindell, ed. 2017. Publisher's VersionAbstract

Version History:

August 2016: Manuscript v1 (see files attached)

March 2017: Manuscript v2 (see files attached); Errata

April 2017: Published Version (in Tutorials on the Foundations of Cryptography; see Publisher's Version link and also SPRINGER 2017.PDF, below)

Differential privacy is a theoretical framework for ensuring the privacy of individual-level data when performing statistical analysis of privacy-sensitive datasets. This tutorial provides an introduction to and overview of differential privacy, with the goal of conveying its deep connections to a variety of other topics in computational complexity, cryptography, and theoretical computer science at large. This tutorial is written in celebration of Oded Goldreich’s 60th birthday, starting from notes taken during a minicourse given by the author and Kunal Talwar at the 26th McGill Invitational Workshop on Computational Complexity [1].

Vadhan., Salil P.On learning vs. refutation.” 30th Conference on Learning Theory (COLT `17), 2017, 65, 1835-1848. Publisher's VersionAbstract
Building on the work of Daniely et al. (STOC 2014, COLT 2016), we study the connection between computationally efficient PAC learning and refutation of constraint satisfaction problems. Specifically, we prove that for every concept class $$\mathcal{P }$$ , PAC-learning $$\mathcal{P}$$ is polynomially equivalent to “random-right-hand-side-refuting” (“RRHS-refuting”) a dual class $$\mathcal{P}^∗$$, where RRHS-refutation of a class $$Q$$ refers to refuting systems of equations where the constraints are (worst-case) functions from the class $$Q$$ but the right-hand-sides of the equations are uniform and independent random bits. The reduction from refutation to PAC learning can be viewed as an abstraction of (part of) the work of Daniely, Linial, and Shalev-Schwartz (STOC 2014). The converse, however, is new, and is based on a combination of techniques from pseudorandomness (Yao ‘82) with boosting (Schapire ‘90). In addition, we show that PAC-learning the class of $$DNF$$ formulas is polynomially equivalent to PAC-learning its dual class $$DNF ^∗$$ , and thus PAC-learning $$DNF$$ is equivalent to RRHS-refutation of $$DNF$$ , suggesting an avenue to obtain stronger lower bounds for PAC-learning $$DNF$$ than the quasipolynomial lower bound that was obtained by Daniely and Shalev-Schwartz (COLT 2016) assuming the hardness of refuting $$k$$-SAT.
Nissim, Kobbi, Uri Stemmer, and Salil Vadhan. “Locating a small cluster privately.” In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS ‘16), 413-427. ACM, 2016. Publisher's VersionAbstract

Version HistoryFull version posted as arXiv:1604.05590 [cs.DS].

We present a new algorithm for locating a small cluster of points with differential privacy [Dwork, McSherry, Nissim, and Smith, 2006]. Our algorithm has implications to private data exploration, clustering, and removal of outliers. Furthermore, we use it to significantly relax the requirements of the sample and aggregate technique [Nissim, Raskhodnikova, and Smith, 2007], which allows compiling of “off the shelf” (non-private) analyses into analyses that preserve differential privacy.

Gaboardi, Marco, Hyun Woo Lim, Ryan Rogers, and Salil Vadhan. “Differentially private chi-squared hypothesis testing: Goodness of fit and independence testing.” In M. Balcan and K. Weinberger, editors, Proceedings of the 33rd International Conference on Machine Learning (ICML ‘16). 2111-2120, 2016. Publisher's VersionAbstract

Version History: Preliminary version posted as arXiv:1602.03090.

Hypothesis testing is a useful statistical tool in determining whether a given model should be rejected based on a sample from the population. Sample data may contain sensitive information about individuals, such as medical information. Thus it is important to design statistical tests that guarantee the privacy of subjects in the data. In this work, we study hypothesis testing subject to differential privacy, specifically chi-squared tests for goodness of fit for multinomial data and independence between two categorical variables.

We propose new tests for goodness of fit and independence testing that like the classical versions can be used to determine whether a given model should be rejected or not, and that additionally can ensure differential privacy. We give both Monte Carlo based hypothesis tests as well as hypothesis tests that more closely follow the classical chi-squared goodness of fit test and the Pearson chi-squared test for independence. Crucially, our tests account for the distribution of the noise that is injected to ensure privacy in determining significance.

We show that these tests can be used to achieve desired significance levels, in sharp contrast to direct applications of classical tests to differentially private contingency tables which can result in wildly varying significance levels. Moreover, we study the statistical power of these tests. We empirically show that to achieve the same level of power as the classical non-private tests our new tests need only a relatively modest increase in sample size.

Bun, Mark, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. “Differentially private release and learning of threshold functions.” In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘15). IEEE, 2015. Publisher's VersionAbstract

Version HistoryFull version posted as arXiv:1504.07553.

We prove new upper and lower bounds on the sample complexity of $$(\varepsilon, \delta)$$ differentially private algorithms for releasing approximate answers to threshold functions. A threshold function $$c_x$$ over a totally ordered domain $$X$$ evaluates to $$c_x(y)=1$$ if $$y \leq {x}$$, and evaluates to $$0$$ otherwise. We give the first nontrivial lower bound for releasing thresholds with $$(\varepsilon, \delta)$$ differential privacy, showing that the task is impossible over an infinite domain $$X$$, and moreover requires sample complexity $$n \geq \Omega(\log^* |X|)$$, which grows with the size of the domain. Inspired by the techniques used to prove this lower bound, we give an algorithm for releasing thresholds with $$n ≤ 2^{(1+o(1)) \log^∗|X|}$$ samples. This improves the previous best upper bound of $$8^{(1+o(1)) \log^∗ |X|}$$(Beimel et al., RANDOM ’13).

Our sample complexity upper and lower bounds also apply to the tasks of learning distri- butions with respect to Kolmogorov distance and of properly PAC learning thresholds with differential privacy. The lower bound gives the first separation between the sample complexity of properly learning a concept class with $$(\varepsilon, \delta)$$ differential privacy and learning without privacy. For properly learning thresholds in $$\ell$$ dimensions, this lower bound extends to $$n ≥ Ω(\ell·\log^∗ |X|)$$.

To obtain our results, we give reductions in both directions from releasing and properly learning thresholds and the simpler interior point problem. Given a database $$D$$ of elements from $$X$$, the interior point problem asks for an element between the smallest and largest elements in $$D$$. We introduce new recursive constructions for bounding the sample complexity of the interior point problem, as well as further reductions and techniques for proving impossibility results for other basic problems in differential privacy.

Dwork, Cynthia, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. “Robust traceability from trace amounts.” In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘15), 650-669. IEEE, 2015. Publisher's VersionAbstract

The privacy risks inherent in the release of a large number of summary statistics were illustrated by Homer et al. (PLoS Genetics, 2008), who considered the case of 1-way marginals of SNP allele frequencies obtained in a genome-wide association study: Given a large number of minor allele frequencies from a case group of individuals diagnosed with a particular disease, together with the genomic data of a single target individual and statistics from a sizable reference dataset independently drawn from the same population, an attacker can determine with high confidence whether or not the target is in the case group.

In this work we describe and analyze a simple attack that succeeds even if the summary statistics are significantly distorted, whether due to measurement error or noise intentionally introduced to protect privacy. Our attack only requires that the vector of distorted summary statistics is close to the vector of true marginals in $$\ell_1$$norm. Moreover, the reference pool required by previous attacks can be replaced by a single sample drawn from the underlying population.

The new attack, which is not specific to genomics and which handles Gaussian as well as Bernouilli data, significantly generalizes recent lower bounds on the noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014; Steinke and Ullman, 2015), obviating the need for the attacker to control the exact distribution of the data.

Thaler, Justin, Jonathan Ullman, and Salil Vadhan. “Faster algorithms for privately releasing marginals.” In Artur Czumaj, Kurt Mehlhorn, Andrew M. Pitts, and Roger Wattenhofer, editors, Proceedings of the 39th International Colloquium on Automata, Languages, and Programming (ICALP ‘12), Lecture Notes on Computer Science, 7391:810-821. Springer-Verlag, 2012. Publisher's VersionAbstract

Version History: Full version posted as arXiv:1205.1758v2.

We study the problem of releasing k-way marginals of a database $$D ∈ ({0,1}^d)^n$$, while preserving differential privacy. The answer to a $$k$$-way marginal query is the fraction of $$D$$’s records $$x \in \{0,1\}^d$$ with a given value in each of a given set of up to $$k$$ columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf. Barak et. al., PODS ’07).

We give an algorithm that runs in time $$d^{O(\sqrt{K})}$$ and releases a private summary capable of answering any $$k$$-way marginal query with at most ±.01 error on every query as long as $$n \geq d^{O(\sqrt{K})}$$. To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of k-way marginal queries, which is $$d^{\Theta(k)}$$ (for $$k ≪ d$$).

Dwork, Cynthia, Guy Rothblum, and Salil Vadhan. “Boosting and differential privacy.” In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘10), 51-60. IEEE, 2010. Publisher's VersionAbstract
Boosting is a general method for improving the accuracy of learning algorithms. We use boosting to construct improved privacy-preserving synopses of an input database. These are data structures that yield, for a given set $$\mathcal{Q}$$ of queries over an input database, reasonably accurate estimates of the responses to every query in $$\mathcal{Q}$$, even when the number of queries is much larger than the number of rows in the database. Given a base synopsis generator that takes a distribution on $$\mathcal{Q}$$ and produces a "weak" synopsis that yields "good" answers for a majority of the weight in $$\mathcal{Q}$$, our Boosting for Queries algorithm obtains a synopsis that is good for all of $$\mathcal{Q}$$. We ensure privacy for the rows of the database, but the boosting is performed on the queries. We also provide the first synopsis generators for arbitrary sets of arbitrary low-sensitivity queries, i.e., queries whose answers do not vary much under the addition or deletion of a single row. In the execution of our algorithm certain tasks, each incurring some privacy loss, are performed many times. To analyze the cumulative privacy loss, we obtain an $$O(\varepsilon^2)$$ bound on the expected privacy loss from a single $$\varepsilon$$-differentially private mechanism. Combining this with evolution of confidence arguments from the literature, we get stronger bounds on the expected cumulative privacy loss due to multiple mechanisms, each of which provides $$\varepsilon$$-differential privacy or one of its relaxations, and each of which operates on (potentially) different, adaptively chosen, databases.
Trevisan, Luca, Madhur Tulsiani, and Salil Vadhan. “Regularity, boosting, and efficiently simulating every high-entropy distribution.” In Proceedings of the 24th Annual IEEE Conference on Computational Complexity (CCC ‘09), 126-136. IEEE, 2009. Publisher's VersionAbstract

Version HistoryPreliminary version posted as ECCC TR08-103.

We show that every bounded function $${g}$$ : $$\{0,1\}^n \to [0,1]$$ admits an efficiently computable "simulator" function $${h}$$ : $$\{0,1\}^n \to [0,1]$$ such that every fixed polynomial size circuit has approximately the same correlation with $${g}$$ as with $${h}$$. If g describes (up to scaling) a high min-entropy distribution $$D$$, then $${h}$$ can be used to efficiently sample a distribution $$D'$$ of the same min-entropy that is indistinguishable from $$D$$ by circuits of fixed polynomial size. We state and prove our result in a more abstract setting, in which we allow arbitrary finite domains instead of $$\{0,1\}^n$$, and arbitrary families of distinguishers, instead of fixed polynomial size circuits. Our result implies (a) the weak Szemeredi regularity Lemma of Frieze and Kannan (b) a constructive version of the dense model theorem of Green, Tao and Ziegler with better quantitative parameters (polynomial rather than exponential in the distinguishing probability), and (c) the Impagliazzo hardcore set Lemma. It appears to be the general result underlying the known connections between "regularity" results in graph theory, "decomposition" results in additive combinatorics, and the hardcore Lemma in complexity theory. We present two proofs of our result, one in the spirit of Nisan's proof of the hardcore Lemma via duality of linear programming, and one similar to Impagliazzo's "boosting" proof. A third proof by iterative partitioning, which gives the complexity of the sampler to be exponential in the distinguishing probability, is also implicit in the Green-Tao-Ziegler proofs of the dense model theorem.

Bogdanov, Andrej, Elchanan Mossel, and Salil Vadhan. “The complexity of distinguishing Markov random fields.” In Proceedings of the 12th International Workshop on Randomization and Computation (RANDOM ‘08), Lecture Notes in Computer Science, 5171:331-342. Springer-Verlag, 2008. Publisher's VersionAbstract

Markov random fields are often used to model high dimensional distributions in a number of applied areas. A number of recent papers have studied the problem of reconstructing a dependency graph of bounded degree from independent samples from the Markov random field. These results require observing samples of the distribution at all nodes of the graph. It was heuristically recognized that the problem of reconstructing the model where there are hidden variables (some of the variables are not observed) is much harder.

Here we prove that the problem of reconstructing bounded-degree models with hidden nodes is hard. Specifically, we show that unless NP = RP,

• It is impossible to decide in randomized polynomial time if two mod- els generate distributions whose statistical distance is at most 1/3 or at least 2/3.
• Given two generating models whose statistical distance is promised to be at least 1/3, and oracle access to independent samples from one of the models, it is impossible to decide in randomized polynomial time which of the two samples is consistent with the model.

The second problem remains hard even if the samples are generated efficiently, albeit under a stronger assumption.

Ron, Dana, Amir Rosenfeld, and Salil Vadhan. “The hardness of the expected decision depth problem.” Information Processing Letters 101, no. 3 (2007): 112-118. Publisher's VersionAbstract

Given a function $$f$$ over $$n$$ binary variables, and an ordering of the $$n$$ variables, we consider the Expected Decision Depth problem. Namely, what is the expected number of bits that need to be observed until the value of the function is determined, when bits of the input are observed according to the given order. Our main finding is that this problem is (essentially) #P-complete. Moreover, the hardness holds even when the function f is represented as a decision tree.