19 January 2021

New Professor investigates differential privacy techniques to protect sensitive data

Privacy

How do we develop solutions that can make useful data analyses without invading privacy? Is it possible to share data on populations that cannot be traced back to individuals? These are some of the mathematical problems that the Department of Computer Science's new Professor Rasmus Pagh deals with.

Professor Rasmus Pagh

Today, huge amounts of data are being collected throughout our society, containing valuable information that can be used to improve a wide range of areas. Through data analyses, we can develop artificial intelligence systems for healthcare, anticipate changes in climate and economy, and companies can improve their products and services.

However, since many human data are personally sensitive, it is important to develop methods that enable useful analyses without violating the privacy of citizens. It is such methods that the Department of Computer Science's new professor, Rasmus Pagh, develops.

Among other things, Rasmus researches differential privacy – in his own words “a lightweight cryptography” that makes it possible to analyse data that is potentially sensitive without having to fear that the analyses leak personally sensitive information.

- It is crucial for a data-driven world that people feel safe in terms of sharing data. Unfortunately, we have seen several cases where companies and states have published data analyses that seemed harmless, but which could actually be used to find private information when combined with other sources. Within differential privacy, we try to find solutions to this problem, says Rasmus.

According to Rasmus, institutions and companies are usually not interested in information about the behaviour of individual users, costumers, or citizens. Instead, they are interested in a summary of data that shows useful patterns for a larger group.

- Therefore, the goal of differential privacy is to be able to release as much information as possible without telling about individuals. We want to provide information about a population of some sort and, at the same time, make sure that you cannot use this data to gain knowledge about individuals. In many cases, this is indeed possible, says Rasmus.

Visiting faculty researcher at Google

Rasmus designs and analyses so-called randomized algorithms, for which he is internationally recognized. These are algorithms that make random selections along the way and can be used to ensure privacy.

Rasmus has primarily dealt with the theoretical part of algorithm design but he is also interested in how algorithms can be used in practice. In 2019, Rasmus was a visiting faculty researcher at Google, working on space-efficient machine learning and differential privacy.

- It was exciting to visit Google's research department where theoretical ideas often end up being used on a large scale. I like to move back and forth between, on the one hand, the theoretical problems and, on the other hand, the impact that the research can have outside the walls of the university, says Rasmus.

Differential privacy is of great value to Big Tech companies, which are often criticised for their massive data collection. Other Big Tech companies like Apple and Facebook are also working with differential privacy and according to Rasmus, the techniques will also become relevant to public institutions in the future.

- In the long run, it will make sense to implement this to public institutions that are also in possession of a lot of useful data. This is already done today by the US Census. Here, you need to be sure that what you tell the outside world does not violate anyone's privacy.

Protecting sensitive data with differential privacy - how it works

For many years, institutions and companies have anonymised personal data through ad-hoc methods, for example when publishing large population surveys. This means that institutions have chosen what information to hide based on the individual study. The problem with this method is that, in many cases, you will be able to combine information from different sources to find personally sensitive information that you should not be able to find.

Imagine that the University of Copenhagen chooses to provide data about the average salary per employee and updates it every time a new employee is hired. If you know when a person started working at the university, you will be able to calculate what he or she earns, based on how much the average salary increased or decreased.

Differential privacy avoids making accurate statistics available. Instead of publishing the average salary x, you can, for example, publish x + s, where s is a randomly chosen number, a kind of "noise" that is added to x. Every time new statistics are published, new noise is introduced, which makes the method robust to analyses such as the one described above.