# Privacy for Social Science Research (Gift from Google, Inc.)

2020
Chen, Yiling, Or Sheffet, and Salil Vadhan. “Privacy games.” ACM Transactions on Economics and Computation 8, no. 2 (2020): Article 9. Publisher's VersionAbstract

Version History:

Previously published as: Yiling Chen, Or Sheffet, and Salil Vadhan. Privacy games. In Proceedings of the 10th International Conference on Web and Internet Economics (WINE ‘14), volume 8877 of Lecture Notes in Computer Science, pages 371–385. Springer-Verlag, 14–17 December 2014. (WINE Publisher's Version linked here: https://link.springer.com/chapter/10.1007/978-3-319-13129-0_30); PDF attached as WINE2014.

The problem of analyzing the effect of privacy concerns on the behavior of selfish utility-maximizing agents has received much attention lately. Privacy concerns are often modeled by altering the utility functions of agents to consider also their privacy loss. Such privacy aware agents prefer to take a randomized strategy even in very simple games in which non-privacy aware agents play pure strategies. In some cases, the behavior of privacy aware agents follows the framework of Randomized Response, a well-known mechanism that preserves differential privacy.

Our work is aimed at better understanding the behavior of agents in settings where their privacy concerns are explicitly given. We consider a toy setting where agent A, in an attempt to discover the secret type of agent B, offers B a gift that one type of B agent likes and the other type dislikes. As opposed to previous works, B's incentive to keep her type a secret isn't the result of "hardwiring" B's utility function to consider privacy, but rather takes the form of a payment between B and A. We investigate three different types of payment functions and analyze B's behavior in each of the resulting games. As we show, under some payments, B's behavior is very different than the behavior of agents with hardwired privacy concerns and might even be deterministic. Under a different payment we show that B's BNE strategy does fall into the framework of Randomized Response.

2018
Bun, Mark, Jonathan Ullman, and Salil Vadhan. “Fingerprinting codes and the price of approximate differential privacy.” SIAM Journal on Computing, Special Issue on STOC '14 47, no. 5 (2018): 1888-1938. Publisher's VersionAbstract

Version HistorySpecial Issue on STOC ‘14. Preliminary versions in STOC ‘14 and arXiv:1311.3158 [cs.CR].

We show new information-theoretic lower bounds on the sample complexity of (ε, δ)- differentially private algorithms that accurately answer large sets of counting queries. A counting query on a database $$D ∈ (\{0, 1\}^d)^n$$ has the form “What fraction of the individual records in the database satisfy the property $$q$$?” We show that in order to answer an arbitrary set $$Q$$ of $$\gg d/ \alpha^2$$ counting queries on $$D$$ to within error $$±α$$ it is necessary that $$n ≥ \tilde{Ω}(\sqrt{d} \log |Q|/α^2ε)$$. This bound is optimal up to polylogarithmic factors, as demonstrated by the private multiplicative weights algorithm (Hardt and Rothblum, FOCS’10). In particular, our lower bound is the first to show that the sample complexity required for accuracy and (ε, δ)-differential privacy is asymptotically larger than what is required merely for accuracy, which is $$O(\log |Q|/α^2 )$$. In addition, we show that our lower bound holds for the specific case of $$k$$-way marginal queries (where $$|Q| = 2^k \binom{d}{k}$$ ) when $$\alpha$$ is not too small compared to d (e.g., when $$\alpha$$ is any fixed constant). Our results rely on the existence of short fingerprinting codes (Boneh and Shaw, CRYPTO’95; Tardos, STOC’03), which we show are closely connected to the sample complexity of differentially private data release. We also give a new method for combining certain types of sample-complexity lower bounds into stronger lower bounds.

2016
Chen, Yiling, Stephen Chong, Ian A. Kash, Tal Moran, and Salil P. Vadhan. “Truthful mechanisms for agents that value privacy.” ACM Transactions on Economics and Computation 4, no. 3 (2016). Publisher's VersionAbstract

Version History: Special issue on EC ‘13. Preliminary version at arXiv:1111.5472 [cs.GT] (Nov. 2011).

Recent work has constructed economic mechanisms that are both truthful and differentially private. In these mechanisms, privacy is treated separately from truthfulness; it is not incorporated in players’ utility functions (and doing so has been shown to lead to nontruthfulness in some cases). In this work, we propose a new, general way of modeling privacy in players’ utility functions. Specifically, we only assume that if an outcome $${o}$$ has the property that any report of player $${i}$$ would have led to $${o}$$ with approximately the same probability, then $${o}$$ has a small privacy cost to player $${i}$$. We give three mechanisms that are truthful with respect to our modeling of privacy: for an election between two candidates, for a discrete version of the facility location problem, and for a general social choice problem with discrete utilities (via a VCG-like mechanism). As the number $${n}$$ of players increases, the social welfare achieved by our mechanisms approaches optimal (as a fraction of $${n}$$).

2015
Bun, Mark, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. “Differentially private release and learning of threshold functions.” In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘15). IEEE, 2015. Publisher's VersionAbstract

Version HistoryFull version posted as arXiv:1504.07553.

We prove new upper and lower bounds on the sample complexity of $$(\varepsilon, \delta)$$ differentially private algorithms for releasing approximate answers to threshold functions. A threshold function $$c_x$$ over a totally ordered domain $$X$$ evaluates to $$c_x(y)=1$$ if $$y \leq {x}$$, and evaluates to $$0$$ otherwise. We give the first nontrivial lower bound for releasing thresholds with $$(\varepsilon, \delta)$$ differential privacy, showing that the task is impossible over an infinite domain $$X$$, and moreover requires sample complexity $$n \geq \Omega(\log^* |X|)$$, which grows with the size of the domain. Inspired by the techniques used to prove this lower bound, we give an algorithm for releasing thresholds with $$n ≤ 2^{(1+o(1)) \log^∗|X|}$$ samples. This improves the previous best upper bound of $$8^{(1+o(1)) \log^∗ |X|}$$(Beimel et al., RANDOM ’13).

Our sample complexity upper and lower bounds also apply to the tasks of learning distri- butions with respect to Kolmogorov distance and of properly PAC learning thresholds with differential privacy. The lower bound gives the first separation between the sample complexity of properly learning a concept class with $$(\varepsilon, \delta)$$ differential privacy and learning without privacy. For properly learning thresholds in $$\ell$$ dimensions, this lower bound extends to $$n ≥ Ω(\ell·\log^∗ |X|)$$.

To obtain our results, we give reductions in both directions from releasing and properly learning thresholds and the simpler interior point problem. Given a database $$D$$ of elements from $$X$$, the interior point problem asks for an element between the smallest and largest elements in $$D$$. We introduce new recursive constructions for bounding the sample complexity of the interior point problem, as well as further reductions and techniques for proving impossibility results for other basic problems in differential privacy.

Dwork, Cynthia, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. “Robust traceability from trace amounts.” In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘15), 650-669. IEEE, 2015. Publisher's VersionAbstract

The privacy risks inherent in the release of a large number of summary statistics were illustrated by Homer et al. (PLoS Genetics, 2008), who considered the case of 1-way marginals of SNP allele frequencies obtained in a genome-wide association study: Given a large number of minor allele frequencies from a case group of individuals diagnosed with a particular disease, together with the genomic data of a single target individual and statistics from a sizable reference dataset independently drawn from the same population, an attacker can determine with high confidence whether or not the target is in the case group.

In this work we describe and analyze a simple attack that succeeds even if the summary statistics are significantly distorted, whether due to measurement error or noise intentionally introduced to protect privacy. Our attack only requires that the vector of distorted summary statistics is close to the vector of true marginals in $$\ell_1$$norm. Moreover, the reference pool required by previous attacks can be replaced by a single sample drawn from the underlying population.

The new attack, which is not specific to genomics and which handles Gaussian as well as Bernouilli data, significantly generalizes recent lower bounds on the noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014; Steinke and Ullman, 2015), obviating the need for the attacker to control the exact distribution of the data.

2012
Thaler, Justin, Jonathan Ullman, and Salil Vadhan. “Faster algorithms for privately releasing marginals.” In Artur Czumaj, Kurt Mehlhorn, Andrew M. Pitts, and Roger Wattenhofer, editors, Proceedings of the 39th International Colloquium on Automata, Languages, and Programming (ICALP ‘12), Lecture Notes on Computer Science, 7391:810-821. Springer-Verlag, 2012. Publisher's VersionAbstract

Version History: Full version posted as arXiv:1205.1758v2.

We study the problem of releasing k-way marginals of a database $$D ∈ ({0,1}^d)^n$$, while preserving differential privacy. The answer to a $$k$$-way marginal query is the fraction of $$D$$’s records $$x \in \{0,1\}^d$$ with a given value in each of a given set of up to $$k$$ columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf. Barak et. al., PODS ’07).

We give an algorithm that runs in time $$d^{O(\sqrt{K})}$$ and releases a private summary capable of answering any $$k$$-way marginal query with at most ±.01 error on every query as long as $$n \geq d^{O(\sqrt{K})}$$. To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of k-way marginal queries, which is $$d^{\Theta(k)}$$ (for $$k ≪ d$$).

Dodis, Yevgeniy, Adriana López-Alt, Ilya Mironov, and Salil Vadhan. “Differential privacy with imperfect randomness.” In Ran Canetti and Rei Safavi-Naini, editors, Proceedings of the 32nd International Cryptology Conference (CRYPTO ‘12), Lecture Notes on Computer Science, 7417:497-516. Springer-Verlag, 2012. Publisher's VersionAbstract

In this work we revisit the question of basing cryptography on imperfect randomness. Bosley and Dodis (TCC’07) showed that if a source of randomness $$\mathcal{R}$$ is “good enough” to generate a secret key capable of encrypting $$k$$ bits, then one can deterministically extract nearly $$k$$ almost uniform bits from $$\mathcal{R}$$, suggesting that traditional privacy notions (namely, indistinguishability of encryption) requires an “extractable” source of randomness. Other, even stronger impossibility results are known for achieving privacy under specific “non-extractable” sources of randomness, such as the $$\gamma$$-Santha-Vazirani (SV) source, where each next bit has fresh entropy, but is allowed to have a small bias $$\gamma < 1$$ (possibly depending on prior bits).

We ask whether similar negative results also hold for a more recent notion of privacy called differential privacy (Dwork et al., TCC’06), concentrating, in particular, on achieving differential privacy with the Santha-Vazirani source. We show that the answer is no. Specifically, we give a differentially private mechanism for approximating arbitrary “low sensitivity” functions that works even with randomness coming from a $$\gamma$$-Santha-Vazirani source, for any $$\gamma < 1$$. This provides a somewhat surprising “separation” between traditional privacy and differential privacy with respect to imperfect randomness.

Interestingly, the design of our mechanism is quite different from the traditional “additive-noise” mechanisms (e.g., Laplace mechanism) successfully utilized to achieve differential privacy with perfect randomness. Indeed, we show that any (non-trivial) “SV-robust” mechanism for our problem requires a demanding property called consistent sampling, which is strictly stronger than differential privacy, and cannot be satisfied by any additive-noise mechanism.

Dwork, Cynthia, Moni Naor, and Salil Vadhan. “The privacy of the analyst and the power of the state.” In Proceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘12), 400-409. IEEE, 2012. Publisher's VersionAbstract
We initiate the study of privacy for the analyst in differentially private data analysis. That is, not only will we be concerned with ensuring differential privacy for the data (i.e. individuals or customers), which are the usual concern of differential privacy, but we also consider (differential) privacy for the set of queries posed by each data analyst. The goal is to achieve privacy with respect to other analysts, or users of the system. This problem arises only in the context of stateful privacy mechanisms, in which the responses to queries depend on other queries posed (a recent wave of results in the area utilized cleverly coordinated noise and state in order to allow answering privately hugely many queries). We argue that the problem is real by proving an exponential gap between the number of queries that can be answered (with non-trivial error) by stateless and stateful differentially private mechanisms. We then give a stateful algorithm for differentially private data analysis that also ensures differential privacy for the analyst and can answer exponentially many queries.