Privacy

2015
O'Brien, David, Jonathan Ullman, Micah Altman, Urs Gasser, Michael Bar-Sinai, Kobbi Nissim, Salil Vadhan, Michael Wojcik, and Alexandra Wood. “Integrating approaches to privacy across the research lifecycle: When is information purely public?Berkman Center Research Publication No. 2015-7, 2015, March. Publisher's VersionAbstract

Version History: Available at SSRN: http://ssrn.com/abstract=2586158.

On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data. 

Researchers are increasingly obtaining data from social networking websites, publicly-placed sensors, government records and other public sources. Much of this information appears public, at least to first impressions, and it is capable of being used in research for a wide variety of purposes with seemingly minimal legal restrictions. The insights about human behaviors we may gain from research that uses this data are promising. However, members of the research community are questioning the ethics of these practices, and at the heart of the matter are some difficult questions about the boundaries between public and private information. This workshop report, the second in a series, identifies selected questions and explores issues around the meaning of “public” in the context of using data about individuals for research purposes. 

 

 
 
SSRN2015.pdf
Bun, Mark, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. “Differentially private release and learning of threshold functions.” In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘15). IEEE, 2015. Publisher's VersionAbstract

Version HistoryFull version posted as arXiv:1504.07553.

We prove new upper and lower bounds on the sample complexity of \((\varepsilon, \delta)\) differentially private algorithms for releasing approximate answers to threshold functions. A threshold function \(c_x\) over a totally ordered domain \(X\) evaluates to \(c_x(y)=1 \) if \(y \leq {x}\), and evaluates to \(0\) otherwise. We give the first nontrivial lower bound for releasing thresholds with \((\varepsilon, \delta)\) differential privacy, showing that the task is impossible over an infinite domain \(X\), and moreover requires sample complexity \(n \geq \Omega(\log^* |X|)\), which grows with the size of the domain. Inspired by the techniques used to prove this lower bound, we give an algorithm for releasing thresholds with \(n ≤ 2^{(1+o(1)) \log^∗|X|}\) samples. This improves the previous best upper bound of \(8^{(1+o(1)) \log^∗ |X|}\)(Beimel et al., RANDOM ’13).

Our sample complexity upper and lower bounds also apply to the tasks of learning distri- butions with respect to Kolmogorov distance and of properly PAC learning thresholds with differential privacy. The lower bound gives the first separation between the sample complexity of properly learning a concept class with \((\varepsilon, \delta)\) differential privacy and learning without privacy. For properly learning thresholds in \(\ell\) dimensions, this lower bound extends to \(n ≥ Ω(\ell·\log^∗ |X|)\).

To obtain our results, we give reductions in both directions from releasing and properly learning thresholds and the simpler interior point problem. Given a database \(D\) of elements from \(X\), the interior point problem asks for an element between the smallest and largest elements in \(D\). We introduce new recursive constructions for bounding the sample complexity of the interior point problem, as well as further reductions and techniques for proving impossibility results for other basic problems in differential privacy.

ArXiv2015.pdf IEEE2015.pdf
Dwork, Cynthia, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. “Robust traceability from trace amounts.” In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘15), 650-669. IEEE, 2015. Publisher's VersionAbstract

The privacy risks inherent in the release of a large number of summary statistics were illustrated by Homer et al. (PLoS Genetics, 2008), who considered the case of 1-way marginals of SNP allele frequencies obtained in a genome-wide association study: Given a large number of minor allele frequencies from a case group of individuals diagnosed with a particular disease, together with the genomic data of a single target individual and statistics from a sizable reference dataset independently drawn from the same population, an attacker can determine with high confidence whether or not the target is in the case group.

In this work we describe and analyze a simple attack that succeeds even if the summary statistics are significantly distorted, whether due to measurement error or noise intentionally introduced to protect privacy. Our attack only requires that the vector of distorted summary statistics is close to the vector of true marginals in \(\ell_1 \)norm. Moreover, the reference pool required by previous attacks can be replaced by a single sample drawn from the underlying population.

The new attack, which is not specific to genomics and which handles Gaussian as well as Bernouilli data, significantly generalizes recent lower bounds on the noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014; Steinke and Ullman, 2015), obviating the need for the attacker to control the exact distribution of the data.

FOCS2015.pdf
2014
Nissim, Kobbi, Salil Vadhan, and David Xiao. “Redrawing the boundaries on purchasing data from privacy-sensitive individuals.” In Moni Naor, editor, Innovations in Theoretical Computer Science (ITCS ‘14), 411-422. ACM, 2014. Publisher's VersionAbstract

Version History: Full version posted as arXiv:1401.4092 [cs.GT].

 

We prove new positive and negative results concerning the existence of truthful and individually rational mechanisms for purchasing private data from individuals with unbounded and sensitive privacy preferences. We strengthen the impos- sibility results of Ghosh and Roth (EC 2011) by extending it to a much wider class of privacy valuations. In particular, these include privacy valuations that are based on \((\varepsilon, \delta)\)- differentially private mechanisms for non-zero \(\delta\), ones where the privacy costs are measured in a per-database manner (rather than taking the worst case), and ones that do not depend on the payments made to players (which might not be observable to an adversary).

To bypass this impossibility result, we study a natural special setting where individuals have monotonic privacy valuations, which captures common contexts where certain values for private data are expected to lead to higher valuations for privacy (e.g. having a particular disease). We give new mechanisms that are individually rational for all players with monotonic privacy valuations, truthful for all players whose privacy valuations are not too large, and accu- rate if there are not too many players with too-large privacy valuations. We also prove matching lower bounds showing that in some respects our mechanism cannot be improved significantly.

ITCS2014.pdf ArXiv2018.pdf
Wood, Alexandra, David O'Brien, Micah Altman, Alan Karr, Urs Gasser, Michael Bar-Sinai, Kobbi Nissim, Jonathan Ullman, Salil Vadhan, and Michael Wojcik. “Integrating approaches to privacy across the research lifecycle: Long-term longitudinal studies.” Berkman Center Research Publication No. 2014-12, 2014, July. Publisher's VersionAbstract

Version History: Available at SSRN: http://ssrn.com/abstract=2469848.

 

On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data. 

This workshop report, the first in a series, provides an overview of the long-term longitudinal study use case. Long-term longitudinal studies collect, at multiple points over a long period of time, highly-specific and often sensitive data describing the health, socioeconomic, or behavioral characteristics of human subjects. The value of such studies lies in part in their ability to link a set of behaviors and changes to each individual, but these factors tend to make the combination of observable characteristics associated with each subject unique and potentially identifiable. 

Using the research information lifecycle as a framework, this report discusses the defining features of long-term longitudinal studies and the associated challenges for researchers tasked with collecting and analyzing such data while protecting the privacy of human subjects. It also describes the disclosure risks and common legal and technical approaches currently used to manage confidentiality in longitudinal data. Finally, it identifies urgent problems and areas for future research to advance the integration of various methods for preserving confidentiality in research data.

SSRN2014.pdf
2012
Thaler, Justin, Jonathan Ullman, and Salil Vadhan. “Faster algorithms for privately releasing marginals.” In Artur Czumaj, Kurt Mehlhorn, Andrew M. Pitts, and Roger Wattenhofer, editors, Proceedings of the 39th International Colloquium on Automata, Languages, and Programming (ICALP ‘12), Lecture Notes on Computer Science, 7391:810-821. Springer-Verlag, 2012. Publisher's VersionAbstract

Version History: Full version posted as arXiv:1205.1758v2.

We study the problem of releasing k-way marginals of a database \(D ∈ ({0,1}^d)^n\), while preserving differential privacy. The answer to a \(k\)-way marginal query is the fraction of \(D\)’s records \(x \in \{0,1\}^d\) with a given value in each of a given set of up to \(k\) columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf. Barak et. al., PODS ’07).

We give an algorithm that runs in time \(d^{O(\sqrt{K})}\) and releases a private summary capable of answering any \(k\)-way marginal query with at most ±.01 error on every query as long as \(n \geq d^{O(\sqrt{K})}\). To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of k-way marginal queries, which is \(d^{\Theta(k)}\) (for \(k ≪ d\)).

ArXiv2018.pdf ICALP2012.pdf
Dodis, Yevgeniy, Adriana López-Alt, Ilya Mironov, and Salil Vadhan. “Differential privacy with imperfect randomness.” In Ran Canetti and Rei Safavi-Naini, editors, Proceedings of the 32nd International Cryptology Conference (CRYPTO ‘12), Lecture Notes on Computer Science, 7417:497-516. Springer-Verlag, 2012. Publisher's VersionAbstract

In this work we revisit the question of basing cryptography on imperfect randomness. Bosley and Dodis (TCC’07) showed that if a source of randomness \(\mathcal{R}\) is “good enough” to generate a secret key capable of encrypting \(k\) bits, then one can deterministically extract nearly \(k\) almost uniform bits from \(\mathcal{R}\), suggesting that traditional privacy notions (namely, indistinguishability of encryption) requires an “extractable” source of randomness. Other, even stronger impossibility results are known for achieving privacy under specific “non-extractable” sources of randomness, such as the \(\gamma\)-Santha-Vazirani (SV) source, where each next bit has fresh entropy, but is allowed to have a small bias \(\gamma < 1\) (possibly depending on prior bits).

We ask whether similar negative results also hold for a more recent notion of privacy called differential privacy (Dwork et al., TCC’06), concentrating, in particular, on achieving differential privacy with the Santha-Vazirani source. We show that the answer is no. Specifically, we give a differentially private mechanism for approximating arbitrary “low sensitivity” functions that works even with randomness coming from a \(\gamma\)-Santha-Vazirani source, for any \(\gamma < 1\). This provides a somewhat surprising “separation” between traditional privacy and differential privacy with respect to imperfect randomness.

Interestingly, the design of our mechanism is quite different from the traditional “additive-noise” mechanisms (e.g., Laplace mechanism) successfully utilized to achieve differential privacy with perfect randomness. Indeed, we show that any (non-trivial) “SV-robust” mechanism for our problem requires a demanding property called consistent sampling, which is strictly stronger than differential privacy, and cannot be satisfied by any additive-noise mechanism.

CRYPTO2012.pdf IACR2012.pdf
Dwork, Cynthia, Moni Naor, and Salil Vadhan. “The privacy of the analyst and the power of the state.” In Proceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘12), 400-409. IEEE, 2012. Publisher's VersionAbstract
We initiate the study of privacy for the analyst in differentially private data analysis. That is, not only will we be concerned with ensuring differential privacy for the data (i.e. individuals or customers), which are the usual concern of differential privacy, but we also consider (differential) privacy for the set of queries posed by each data analyst. The goal is to achieve privacy with respect to other analysts, or users of the system. This problem arises only in the context of stateful privacy mechanisms, in which the responses to queries depend on other queries posed (a recent wave of results in the area utilized cleverly coordinated noise and state in order to allow answering privately hugely many queries). We argue that the problem is real by proving an exponential gap between the number of queries that can be answered (with non-trivial error) by stateless and stateful differentially private mechanisms. We then give a stateful algorithm for differentially private data analysis that also ensures differential privacy for the analyst and can answer exponentially many queries.
IEEE2012.pdf
2010
Dwork, Cynthia, Guy Rothblum, and Salil Vadhan. “Boosting and differential privacy.” In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘10), 51-60. IEEE, 2010. Publisher's VersionAbstract
Boosting is a general method for improving the accuracy of learning algorithms. We use boosting to construct improved privacy-preserving synopses of an input database. These are data structures that yield, for a given set \(\mathcal{Q}\) of queries over an input database, reasonably accurate estimates of the responses to every query in \(\mathcal{Q}\), even when the number of queries is much larger than the number of rows in the database. Given a base synopsis generator that takes a distribution on \(\mathcal{Q}\) and produces a "weak" synopsis that yields "good" answers for a majority of the weight in \(\mathcal{Q}\), our Boosting for Queries algorithm obtains a synopsis that is good for all of \(\mathcal{Q}\). We ensure privacy for the rows of the database, but the boosting is performed on the queries. We also provide the first synopsis generators for arbitrary sets of arbitrary low-sensitivity queries, i.e., queries whose answers do not vary much under the addition or deletion of a single row. In the execution of our algorithm certain tasks, each incurring some privacy loss, are performed many times. To analyze the cumulative privacy loss, we obtain an \(O(\varepsilon^2)\) bound on the expected privacy loss from a single \(\varepsilon\)-differentially private mechanism. Combining this with evolution of confidence arguments from the literature, we get stronger bounds on the expected cumulative privacy loss due to multiple mechanisms, each of which provides \(\varepsilon\)-differential privacy or one of its relaxations, and each of which operates on (potentially) different, adaptively chosen, databases.
FOCS-IEEE 2010.pdf
McGregor, Andrew, Ilya Mironov, Toniann Pitassi, Omer Reingold, Kunal Talwar, and Salil Vadhan. “The limits of two-party differential privacy.” In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS ‘10), 81-90. IEEE, 2010. Publisher's VersionAbstract

Version History and Errata: Subsequent version published in ECCC 2011. Proposition 8 and Part (b) of Theorem 13 in the FOCS version are incorrect, and are removed from the ECCC version.

We study differential privacy in a distributed setting where two parties would like to perform analysis of their joint data while preserving privacy for both datasets. Our results imply almost tight lower bounds on the accuracy of such data analyses, both for specific natural functions (such as Hamming distance) and in general. Our bounds expose a sharp contrast between the two-party setting and the simpler client-server setting (where privacy guarantees are one-sided). In addition, those bounds demonstrate a dramatic gap between the accuracy that can be obtained by differentially private data analysis versus the accuracy obtainable when privacy is relaxed to a computational variant of differential privacy. The first proof technique we develop demonstrates a connection between differential privacy and deterministic extraction from Santha-Vazirani sources. A second connection we expose indicates that the ability to approximate a function by a low-error differentially private protocol is strongly related to the ability to approximate it by a low communication protocol. (The connection goes in both directions).

FOCS2010.pdf ECCC2011.pdf
2009
Dwork, Cynthia, Moni Naor, Omer Reingold, Guy Rothblum, and Salil Vadhan. “On the complexity of differentially private data release: Efficient algorithms and hardness results.” In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC ‘09), 381-390. ACM, 2009. Publisher's VersionAbstract
We consider private data analysis in the setting in which a trusted and trustworthy curator, having obtained a large data set containing private information, releases to the public a "sanitization" of the data set that simultaneously protects the privacy of the individual contributors of data and offers utility to the data analyst. The sanitization may be in the form of an arbitrary data structure, accompanied by a computational procedure for determining approximate answers to queries on the original data set, or it may be a "synthetic data set" consisting of data items drawn from the same universe as items in the original data set; queries are carried out as if the synthetic data set were the actual input. In either case the process is non-interactive; once the sanitization has been released the original data and the curator play no further role. For the task of sanitizing with a synthetic dataset output, we map the boundary between computational feasibility and infeasibility with respect to a variety of utility measures. For the (potentially easier) task of sanitizing with unrestricted output format, we show a tight qualitative and quantitative connection between hardness of sanitizing and the existence of traitor tracing schemes.
STOC2009.pdf
Mironov, Ilya, Omkant Pandey, Omer Reingold, and Salil Vadhan. “Computational differential privacy.” In S. Halevi, editor, Advances in Cryptology—CRYPTO ‘09, Lecture Notes in Computer Science, 5677:126-142. Springer-Verlag, 2009. Publisher's VersionAbstract

The definition of differential privacy has recently emerged as a leading standard of privacy guarantees for algorithms on statistical databases. We offer several relaxations of the definition which require privacy guarantees to hold only against efficient—i.e., computationally-bounded—adversaries. We establish various relationships among these notions, and in doing so, we observe their close connection with the theory of pseudodense sets by Reingold et al.[1]. We extend the dense model theorem of Reingold et al. to demonstrate equivalence between two definitions (indistinguishability-and simulatability-based) of computational differential privacy.

Our computational analogues of differential privacy seem to allow for more accurate constructions than the standard information-theoretic analogues. In particular, in the context of private approximation of the distance between two vectors, we present a differentially-private protocol for computing the approximation, and contrast it with a substantially more accurate protocol that is only computationally differentially private.

CRYPTO2009.pdf

Pages