posted by Richard Power
Anupam Datta is an Associate Professor at Carnegie Mellon University (CMU) and a leading researcher at CMU CyLab. CyLab Chronicles recently sat down with him to discuss his team's latest successes and the future direction of their vital research. Here is our conversation.
In the 21st Century, the privacy space has become a very challenging one - for government, for business and for each of us as individuals. This very challenging space is fraught with complex issues. Which particular issue does the work in "Bootstrapping Privacy Compliance in Big Data Systems" seek to address?
Tell us something about the system you and your team designed? And briefly describe its components, Legalease and Grok?
Datta: Central to the design of the workflow and tool chain are (a) *Legalease* — a language that allows specification of privacy policies that impose restrictions on how user data flows through software systems; and (b) *Grok* — a data inventory that annotates big data software systems (written in the Map-Reduce programming model) with Legalease’s policy datatypes, thus enabling automated compliance checking. Let me elaborate.
Software systems that perform data analytics over personal information of users are often written without a technical connection to the privacy policies that they are meant to respect. Tens of millions of lines of such code are already in place in companies like Facebook, Google, and Microsoft. An important challenge is thus to *bootstrap* existing software for privacy compliance. Grok addresses this challenge. It annotates software written in programming languages that support the Map-Reduce programming model (e.g., Dremel, Hive, Scope) with Legalease’s policy datatypes. We focus on this class of languages because they are the languages of choice in industry for writing data analytics code. A simple way to conduct the bootstrapping process would be to ask developers to manually annotate all code with policy datatypes (e.g., labelling variables as IPAddress, programs as being for the purpose of Advertising etc.). However, this process is too labor-intensive to scale. Instead, we develop a set of techniques to automate the bootstrapping process. I should add that the development of Grok was led by Microsoft Research and was underway before our collaboration with them began.
What do you see as the practical outcome of this research? Is it going to be applicable in the operational environment? What is it leading toward?
Datta: One practical outcome of this research is that it enables companies to check that their big data software systems are compliant with their privacy promises. Indeed the prototype system is already running on the production system for Bing, Microsoft's search engine. So, yes, it is applicable to current operational environments.
I view this result as a significant step forward in ensuring privacy compliance at scale. With the advent of technology that can help companies keep their privacy promises, users can reasonably expect stronger privacy promises from companies that operate over their personal information. Of course, much work remains to be done in expanding this kind of compliance technology to cover a broader set of privacy policies and in its adoption in a wide range of organizations.
What other privacy challenges are you looking to address in your ongoing research? What's next for your team?
Datta: We are deeply examining the Web advertising ecosystem, in particular, its *transparency* and its implications for *privacy* and *digital discrimination*. There are important computational questions lurking behind this ecosystem: Are web-advertising systems transparently explaining what information they collect and how they use that information to serve targeted ads? Are these systems compliant with promises they make that they won't use certain types of information (e.g., race, health information, sexual orientation) for advertising? Can we answer these questions from the "outside", i.e., without gaining access to the source code and data of web advertising systems (in contrast with our privacy compliance work with Microsoft Research)? How can we enhance transparency of the current Web advertising ecosystem? What does digital discrimination mean in computational terms and how can we design systems that avoid it?
As a researcher who has looked long and deeply into the privacy space, what do you think is most lacking in the approaches of government and business? What is most needed? What are people missing?
Datta: First, I believe that there is a pressing need for better *computational tools* that organizations can (and should) use to ensure that their software systems and people are compliant with their privacy promises. These tools have to be based on well-founded computing principles and at the same time have to be usable by the target audience, which may include lawyers and other groups who are not computer scientists.
Second, current policies about collection, use, and disclosure of personal information often make rather *weak promises*. This is, in part, a corollary of my first point: organizations do not want to make promises that they don't have the tools to help them be compliant with. But it is also driven by economic considerations, e.g., extensive use of users' personal information for advertising can result in higher click-through rates. While many companies make promises that they will not use certain types of information (e.g., race, health information, sexual orientation) for advertising, it is far from clear what these promises even mean and how a company can demonstrate they are compliant with these promises.
Third, we need better *transparency mechanisms* from organizations that explain what information they collect and how they use that information. Self-declared privacy policies and mechanisms like ad settings managers are steps in the right direction. However, the transparency mechanisms should themselves be auditable to ensure that they indeed transparently reflect the information handling practices of the organization.
These are all fascinating questions -- intellectually deep from a computer science standpoint and hugely relevant to the lives of hundreds of millions of people around the world. They keep me and my research group up at night!
See all CyLab Chronicles articles