Data confidentiality, privacy, & security

Researcher(s): Stephen Fienberg

Research Area: Research Area: Protecting Privacy & Confidentiality of Information

dividing line

Abstract

The proposed research will continue our work in shared statistical computation in two areas: Secure Statistical Computations with Distributed Databases and Data Compression and Formulations of Privacy Based on Statistical Decision Theory.

Secure Statistical Computations with Distributed Databases: The work attempts to combine two notions of privacy in database calculations, privacy for the owners of the databases (coming largely from the computer science domain) and individual privacy of those whose data is contained therein (coming from the statistics domain.) This research builds on earlier efforts involving record linkage and disclosure limitation methodology and will include empirical implementation of these methods using several very large distributed databases, extending the work to incorporate measurement error and secure record linkage. We will also compare the secure regression and logistic regression methods with "unsecure" approaches based from calculations based on the sharing of compressed data and other cryptographically protected forms of data.

Data Compression and Formulations of Privacy Based on Statistical Decision Theory: The compressed regression formulation can be viewed as a form of matrix masking which combines variables in order to prevent individual identification through unique combinations of values. This research will investigate a novel approach for assessing the trade-off between privacy and utility, using statistical decision theory tools. Suppose we plan to release a database. What level of protection do we need to apply to the database to prevent adversaries attempts to compromise privacy, but at the same time, to enable legitimate users to perform useful statistical analysis. Adopting minimax theory for assessing the "utility" of the released database, we are working to develop a rigorous, probabilistic definition of privacy that attempts to protect a large fraction of individuals in the database with high probability.