# Publications by Year: 2015

**Version History**: Available at SSRN: http://ssrn.com/abstract=2586158.

On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.

Researchers are increasingly obtaining data from social networking websites, publicly-placed sensors, government records and other public sources. Much of this information appears public, at least to first impressions, and it is capable of being used in research for a wide variety of purposes with seemingly minimal legal restrictions. The insights about human behaviors we may gain from research that uses this data are promising. However, members of the research community are questioning the ethics of these practices, and at the heart of the matter are some difficult questions about the boundaries between public and private information. This workshop report, the second in a series, identifies selected questions and explores issues around the meaning of “public” in the context of using data about individuals for research purposes.

**Version History**: Full version posted as arXiv:1504.07553.

We prove new upper and lower bounds on the sample complexity of \((\varepsilon, \delta)\) differentially private algorithms for releasing approximate answers to threshold functions. A threshold function \(c_x\) over a totally ordered domain \(X\) evaluates to \(c_x(y)=1 \) if \(y \leq {x}\), and evaluates to \(0\) otherwise. We give the first nontrivial lower bound for releasing thresholds with \((\varepsilon, \delta)\) differential privacy, showing that the task is impossible over an infinite domain \(X\), and moreover requires sample complexity \(n \geq \Omega(\log^* |X|)\), which grows with the size of the domain. Inspired by the techniques used to prove this lower bound, we give an algorithm for releasing thresholds with \(n ≤ 2^{(1+o(1)) \log^∗|X|}\) samples. This improves the previous best upper bound of \(8^{(1+o(1)) \log^∗ |X|}\)(Beimel et al., RANDOM ’13).

Our sample complexity upper and lower bounds also apply to the tasks of learning distri- butions with respect to Kolmogorov distance and of properly PAC learning thresholds with differential privacy. The lower bound gives the first separation between the sample complexity of properly learning a concept class with \((\varepsilon, \delta)\) differential privacy and learning without privacy. For properly learning thresholds in \(\ell\) dimensions, this lower bound extends to \(n ≥ Ω(\ell·\log^∗ |X|)\).

To obtain our results, we give reductions in both directions from releasing and properly learning thresholds and the simpler *interior point problem*. Given a database \(D\) of elements from \(X\), the interior point problem asks for an element between the smallest and largest elements in \(D\). We introduce new recursive constructions for bounding the sample complexity of the interior point problem, as well as further reductions and techniques for proving impossibility results for other basic problems in differential privacy.

For Boolean functions computed by read-once, depth-D circuits with unbounded fan-in over the de Morgan basis, we present an explicit pseudorandom generator with seed length \(\tilde{O}(\log^{D+1} n)\). The previous best seed length known for this model was \(\tilde{O}(\log^{D+4} n)\), obtained by Trevisan and Xue (CCC ‘13) for all of AC0 (not just read-once). Our work makes use of Fourier analytic techniques for pseudorandomness introduced by Reingold, Steinke, and Vadhan (RANDOM ‘13) to show that the generator of Gopalan et al. (FOCS ‘12) fools read-once AC0. To this end, we prove a new Fourier growth bound for read-once circuits, namely that for every \(F : \{0,1\}^n\rightarrow \{0,1\}\) computed by a read-once, depth-\(D\) circuit,

\(\left|\hat{F}[s]\right| \leq O\left(\log^{D-1} n\right)^k,\)

where \(\hat{F}\) denotes the Fourier transform of \(F\) over \(\mathbb{Z}_2^n\).

The privacy risks inherent in the release of a large number of summary statistics were illustrated by Homer *et al.* (*PLoS Genetics*, 2008), who considered the case of 1-way marginals of SNP allele frequencies obtained in a genome-wide association study: Given a large number of minor allele frequencies from a case group of individuals diagnosed with a particular disease, together with the genomic data of a single target individual and statistics from a sizable reference dataset independently drawn from the same population, an attacker can determine with high confidence whether or not the target is in the case group.

In this work we describe and analyze a simple attack that succeeds even if the summary statistics are significantly distorted, whether due to measurement error or noise intentionally introduced to protect privacy. Our attack only requires that the vector of distorted summary statistics is close to the vector of true marginals in \(\ell_1 \)norm. Moreover, the reference pool required by previous attacks can be replaced by *a single sample* drawn from the underlying population.

The new attack, which is not specific to genomics and which handles Gaussian as well as Bernouilli data, significantly generalizes recent lower bounds on the noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014; Steinke and Ullman, 2015), obviating the need for the attacker to control the exact distribution of the data.